HelixTechnologies · gusfraser · Jul 25, 2025 · Jul 25, 2025 · Jul 25, 2025
diff --git a/README.md b/README.md
@@ -973,6 +973,200 @@ system_prompt: |
   information when asked but expect efficient service.
 ```
 
+## 🎯 Goal Evaluation Modes
+
+ReplicantX provides intelligent goal evaluation to accurately determine when conversation objectives have been achieved, solving the common problem of false positives with simple keyword matching.
+
+### The Problem with Keywords
+
+Traditional keyword-based completion detection can produce false positives:
+
+```yaml
+# Problematic scenario
+completion_keywords: ["confirmed", "booked"]
+
+# False positive examples:
+# ❌ "I'll let you know when your booking has been confirmed" (contains "confirmed")
+# ❌ "Have you booked with us before?" (contains "booked") 
+# ❌ "Your booking confirmation is pending" (contains "booking")
+```
+
+### Three Evaluation Modes
+
+#### 1. **Keywords Mode** (Default - Backwards Compatible)
+Simple substring matching - the original behavior:
+
+```yaml
+replicant:
+  goal: "Book a flight to Paris"
+  goal_evaluation_mode: "keywords"  # Default
+  completion_keywords: ["confirmed", "booked", "reservation number"]
+```
+
+**Use when:**
+- ✅ Maintaining existing test compatibility
+- ✅ Simple scenarios with clear completion signals
+- ✅ Performance is critical (no LLM calls)
+
+#### 2. **Intelligent Mode** (Recommended)
+LLM-powered analysis that understands context and intent:
+
+```yaml
+replicant:
+  goal: "Book a business class flight to Paris"
+  goal_evaluation_mode: "intelligent"
+  goal_evaluation_model: "openai:gpt-4o-mini"  # Optional: separate model for evaluation
+  completion_keywords: ["confirmed", "booked"]  # Still required for compatibility
+```
+
+**Benefits:**
+- ✅ **Context-aware**: Distinguishes promises from accomplishments
+- ✅ **False positive reduction**: "I'll confirm later" ≠ "Your booking is confirmed"
+- ✅ **Intent understanding**: Recognizes goal completion without exact keywords
+- ✅ **Reasoning provided**: Detailed explanation of evaluation decisions
+
+#### 3. **Hybrid Mode** (Best of Both Worlds)
+Attempts LLM evaluation first, falls back to keywords if uncertain:
+
+```yaml
+replicant:
+  goal: "Get help with billing issue"
+  goal_evaluation_mode: "hybrid"
+  goal_evaluation_model: "openai:gpt-4o-mini"
+  completion_keywords: ["resolved", "ticket created", "issue closed"]
+```
+
+**Benefits:**
+- ✅ **Smart evaluation** when LLM is confident
+- ✅ **Reliable fallback** when LLM is uncertain
+- ✅ **Cost-effective** for mixed scenarios
+- ✅ **Production-ready** with built-in safety net
+
+### Custom Evaluation Prompts
+
+For domain-specific scenarios, customize the evaluation logic:
+
+```yaml
+replicant:
+  goal: "Complete a customer support ticket"
+  goal_evaluation_mode: "intelligent"
+  goal_evaluation_prompt: |
+    Evaluate if the customer support goal is achieved. Look for:
+    1. Issue resolution confirmation from the agent
+    2. Ticket number or reference provided
+    3. Customer satisfaction or acknowledgment
+    4. Clear closure statements
+
+    Goal: {goal}
+    User Facts: {facts}
+    Recent Conversation: {conversation}
+
+    Respond exactly:
+    RESULT: [ACHIEVED or NOT_ACHIEVED]
+    CONFIDENCE: [0.0 to 1.0]
+    REASONING: [Brief explanation]
+  completion_keywords: ["resolved", "ticket created"]
+```
+
+### Example: Flight Booking with Intelligent Evaluation
+
+```yaml
+name: "Smart Flight Booking Test"
+base_url: "https://api.example.com/chat"
+auth:
+  provider: noop
+level: agent
+replicant:
+  goal: "Book a round-trip business class flight to Paris"
+  facts:
+    name: "Sarah Johnson"
+    email: "sarah@example.com"
+    travel_class: "business"
+    destination: "Paris"
+    departure_city: "New York"
+    travel_date: "next Friday"
+    return_date: "following Monday"
+    budget: "$3000"
+  system_prompt: |
+    You are a customer booking a flight. Provide information when asked
+    but don't volunteer everything upfront. Be conversational and natural.
+  initial_message: "Hi, I'd like to book a flight to Paris."
+  max_turns: 15
+
+  # Intelligent goal evaluation
+  goal_evaluation_mode: "intelligent"
+  goal_evaluation_model: "openai:gpt-4o-mini"  # Fast, cost-effective model
+
+  # Still needed for fallback/compatibility  
+  completion_keywords: ["booked", "confirmed", "reservation number"]
+
+  llm:
+    model: "openai:gpt-4o"
+    temperature: 0.7
+    max_tokens: 150
+```
+
+### Evaluation Results in Reports
+
+The watch mode now shows detailed evaluation information:
+
+```bash
+📊 CONVERSATION COMPLETE
+🏁 Status: ✅ SUCCESS
+🎯 Goal achieved: Yes
+🧠 Evaluation method: intelligent
+📊 Confidence: 0.89
+💭 Reasoning: The flight has been successfully booked with confirmation number ABC123 provided
+```
+
+### Migration Strategy
+
+**Phase 1: Test Intelligent Mode**
+```yaml
+# Update specific tests to use intelligent evaluation
+goal_evaluation_mode: "intelligent"
+```
+
+**Phase 2: Adopt Hybrid Mode**
+```yaml
+# Use hybrid for safety while gaining intelligence
+goal_evaluation_mode: "hybrid"
+```
+
+**Phase 3: Gradual Rollout**
+```yaml
+# Eventually make intelligent/hybrid the default for new tests
+goal_evaluation_mode: "intelligent"
+```
+
+### When to Use Each Mode
+
+| Mode | Use Case | Pros | Cons |
+|------|----------|------|------|
+| **keywords** | Legacy tests, simple APIs | Fast, deterministic | False positives |
+| **intelligent** | Modern apps, complex goals | Accurate, context-aware | Requires LLM |
+| **hybrid** | Production, mixed scenarios | Smart + safe fallback | Slightly more complex |
+
+**Recommendation**: Start with `hybrid` mode for new tests to get the benefits of intelligent evaluation with keyword fallback safety.
+
+### 🧪 Try the Example
+
+See a complete example that demonstrates false positive prevention:
+
+```bash
+# Download the example test
+curl -O https://raw.githubusercontent.com/helixtechnologies/replicantx/main/tests/intelligent_evaluation_example.yaml
+
+# Run with intelligent evaluation
+replicantx run intelligent_evaluation_example.yaml --watch
+
+# Compare with keyword-only mode by changing goal_evaluation_mode to "keywords"
+```
+
+This example shows how intelligent evaluation distinguishes between:
+- ❌ "I'll create a ticket for your issue" (promise)
+- ✅ "Your refund has been processed, reference #REF123" (completion)
+
 ## 🧠 LLM Integration
 
 ReplicantX uses **PydanticAI** for powerful LLM integration with multiple providers:

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "replicantx"
-version = "0.1.4"
+version = "0.1.5"
 description = "End-to-end testing harness for AI agents via web service API"
 readme = "README.md"
 requires-python = ">=3.11"

diff --git a/replicantx/models.py b/replicantx/models.py
@@ -65,6 +65,13 @@ class SessionPlacement(str, Enum):
     URL = "url"  # In URL path (RESTful)
 
 
+class GoalEvaluationMode(str, Enum):
+    """Goal evaluation modes."""
+    KEYWORDS = "keywords"  # Simple keyword matching (legacy behavior)
+    INTELLIGENT = "intelligent"  # LLM-based goal evaluation
+    HYBRID = "hybrid"  # LLM with keyword fallback
+
+
 class LLMConfig(BaseModel):
     """Configuration for LLM using PydanticAI models."""
     model_config = ConfigDict(extra="forbid")
@@ -90,6 +97,18 @@ class Message(BaseModel):
     metadata: Dict[str, Any] = Field(default_factory=dict, description="Additional metadata")
 
 
+class GoalEvaluationResult(BaseModel):
+    """Result of goal evaluation."""
+    model_config = ConfigDict(extra="forbid")
+
+    goal_achieved: bool = Field(..., description="Whether the goal has been achieved")
+    confidence: float = Field(..., description="Confidence score from 0.0 to 1.0")
+    reasoning: str = Field(..., description="Explanation of why the goal is/isn't achieved")
+    evaluation_method: str = Field(..., description="Method used: 'keywords', 'intelligent', or 'hybrid'")
+    fallback_used: bool = Field(False, description="Whether hybrid mode fell back to keywords")
+    timestamp: datetime = Field(default_factory=datetime.now, description="When evaluation was performed")
+
+
 class AssertionResult(BaseModel):
     """Result of an assertion check."""
     model_config = ConfigDict(extra="forbid")
@@ -195,6 +214,11 @@ class ReplicantConfig(BaseModel):
     session_placement: SessionPlacement = Field(SessionPlacement.BODY, description="Session ID placement: 'header', 'body', or 'url' (default: body)")
     session_variable_name: str = Field("session_id", description="Name of the session variable in header/body (default: session_id)")
     llm: LLMConfig = Field(default_factory=LLMConfig, description="LLM configuration for response generation")
+
+    # Goal evaluation configuration
+    goal_evaluation_mode: GoalEvaluationMode = Field(GoalEvaluationMode.KEYWORDS, description="Goal evaluation mode: 'keywords' (default), 'intelligent', or 'hybrid'")
+    goal_evaluation_model: Optional[str] = Field(None, description="PydanticAI model for goal evaluation (defaults to main LLM model if not specified)")
+    goal_evaluation_prompt: Optional[str] = Field(None, description="Custom prompt for goal evaluation (uses default if not specified)")
 
 
 class ScenarioConfig(BaseModel):

diff --git a/replicantx/scenarios/agent.py b/replicantx/scenarios/agent.py
@@ -149,7 +149,13 @@ async def run(self) -> ScenarioReport:
         # Initialize Replicant agent
         self.replicant_agent = ReplicantAgent.create(self.config.replicant)
 
+        current_datetime = datetime.now()
+        date_str = current_datetime.strftime("%A, %B %d, %Y")
+        time_str = current_datetime.strftime("%I:%M %p %Z")
+
         self._debug_log("Replicant Agent initialized", {
+            "current_date": date_str,
+            "current_time": time_str,
             "goal": self.config.replicant.goal,
             "facts_count": len(self.config.replicant.facts),
             "facts": str(self.config.replicant.facts),
@@ -174,7 +180,13 @@ async def run(self) -> ScenarioReport:
 
         # Initialize watch mode
         if self.watch:
+            current_datetime = datetime.now()
+            date_str = current_datetime.strftime("%A, %B %d, %Y")
+            time_str = current_datetime.strftime("%I:%M %p %Z")
+
             self._watch_log("👥 [bold green]LIVE CONVERSATION[/bold green] - Starting agent scenario")
+            self._watch_log(f"📅 Date: {date_str}")
+            self._watch_log(f"🕐 Time: {time_str}")
             self._watch_log(f"🎯 Goal: {self.config.replicant.goal}")
             self._watch_log(f"📝 Facts: {len(self.config.replicant.facts)} items available")
             self._watch_log("")
@@ -191,17 +203,8 @@ async def run(self) -> ScenarioReport:
 
             self._watch_log(f"👤 [bold cyan]User:[/bold cyan] {current_message}")
 
-            # Record initial message in conversation history
-            from ..models import Message
-            initial_message = Message(
-                role="user",
-                content=current_message,
-                timestamp=datetime.now()
-            )
-            self.replicant_agent.state.conversation_history.append(initial_message)
-
             # Continue conversation until completion or limits reached
-            while self.replicant_agent.should_continue_conversation():
+            while await self.replicant_agent.should_continue_conversation():
                 self._debug_log(f"Executing conversation step {step_index + 1}", {
                     "user_message": current_message,
                     "turn_count": self.replicant_agent.state.turn_count,
@@ -254,7 +257,9 @@ async def run(self) -> ScenarioReport:
                         "parsed_response": parsed_response
                     })
 
-                    current_message = await self.replicant_agent.process_api_response(parsed_response)
+                    # For the first response, pass the triggering message to add to conversation history
+                    triggering_message = current_message if step_index == 0 else None
+                    current_message = await self.replicant_agent.process_api_response(parsed_response, triggering_message)
 
                     self._debug_log("Generated next user message", {
                         "next_message": current_message,
@@ -297,6 +302,17 @@ async def run(self) -> ScenarioReport:
                 self._watch_log(f"🎯 Goal achieved: {'Yes' if conversation_summary.get('goal_achieved', False) else 'No'}")
                 self._watch_log(f"📝 Facts used: {conversation_summary.get('facts_used', 0)}")
                 self._watch_log(f"💬 Total turns: {conversation_summary.get('total_turns', 0)}")
+
+                # Add goal evaluation details if available
+                if 'goal_evaluation_method' in conversation_summary:
+                    method = conversation_summary.get('goal_evaluation_method', 'unknown')
+                    confidence = conversation_summary.get('goal_evaluation_confidence', 0.0)
+                    fallback = conversation_summary.get('goal_evaluation_fallback_used', False)
+                    reasoning = conversation_summary.get('goal_evaluation_reasoning', 'No reasoning provided')
+
+                    self._watch_log(f"🧠 Evaluation method: {method}" + (" (fallback used)" if fallback else ""))
+                    self._watch_log(f"📊 Confidence: {confidence:.2f}")
+                    self._watch_log(f"💭 Reasoning: {reasoning}")
 
             self._debug_log("Scenario completed successfully", {
                 "passed": report.passed,