Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
194 changes: 194 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -973,6 +973,200 @@ system_prompt: |
information when asked but expect efficient service.
```

## 🎯 Goal Evaluation Modes

ReplicantX provides intelligent goal evaluation to accurately determine when conversation objectives have been achieved, solving the common problem of false positives with simple keyword matching.

### The Problem with Keywords

Traditional keyword-based completion detection can produce false positives:

```yaml
# Problematic scenario
completion_keywords: ["confirmed", "booked"]

# False positive examples:
# ❌ "I'll let you know when your booking has been confirmed" (contains "confirmed")
# ❌ "Have you booked with us before?" (contains "booked")
# ❌ "Your booking confirmation is pending" (contains "booking")
```

### Three Evaluation Modes

#### 1. **Keywords Mode** (Default - Backwards Compatible)
Simple substring matching - the original behavior:

```yaml
replicant:
goal: "Book a flight to Paris"
goal_evaluation_mode: "keywords" # Default
completion_keywords: ["confirmed", "booked", "reservation number"]
```

**Use when:**
- ✅ Maintaining existing test compatibility
- ✅ Simple scenarios with clear completion signals
- ✅ Performance is critical (no LLM calls)

#### 2. **Intelligent Mode** (Recommended)
LLM-powered analysis that understands context and intent:

```yaml
replicant:
goal: "Book a business class flight to Paris"
goal_evaluation_mode: "intelligent"
goal_evaluation_model: "openai:gpt-4o-mini" # Optional: separate model for evaluation
completion_keywords: ["confirmed", "booked"] # Still required for compatibility
```

**Benefits:**
- ✅ **Context-aware**: Distinguishes promises from accomplishments
- ✅ **False positive reduction**: "I'll confirm later" ≠ "Your booking is confirmed"
- ✅ **Intent understanding**: Recognizes goal completion without exact keywords
- ✅ **Reasoning provided**: Detailed explanation of evaluation decisions

#### 3. **Hybrid Mode** (Best of Both Worlds)
Attempts LLM evaluation first, falls back to keywords if uncertain:

```yaml
replicant:
goal: "Get help with billing issue"
goal_evaluation_mode: "hybrid"
goal_evaluation_model: "openai:gpt-4o-mini"
completion_keywords: ["resolved", "ticket created", "issue closed"]
```

**Benefits:**
- ✅ **Smart evaluation** when LLM is confident
- ✅ **Reliable fallback** when LLM is uncertain
- ✅ **Cost-effective** for mixed scenarios
- ✅ **Production-ready** with built-in safety net

### Custom Evaluation Prompts

For domain-specific scenarios, customize the evaluation logic:

```yaml
replicant:
goal: "Complete a customer support ticket"
goal_evaluation_mode: "intelligent"
goal_evaluation_prompt: |
Evaluate if the customer support goal is achieved. Look for:
1. Issue resolution confirmation from the agent
2. Ticket number or reference provided
3. Customer satisfaction or acknowledgment
4. Clear closure statements

Goal: {goal}
User Facts: {facts}
Recent Conversation: {conversation}

Respond exactly:
RESULT: [ACHIEVED or NOT_ACHIEVED]
CONFIDENCE: [0.0 to 1.0]
REASONING: [Brief explanation]
completion_keywords: ["resolved", "ticket created"]
```

### Example: Flight Booking with Intelligent Evaluation

```yaml
name: "Smart Flight Booking Test"
base_url: "https://api.example.com/chat"
auth:
provider: noop
level: agent
replicant:
goal: "Book a round-trip business class flight to Paris"
facts:
name: "Sarah Johnson"
email: "sarah@example.com"
travel_class: "business"
destination: "Paris"
departure_city: "New York"
travel_date: "next Friday"
return_date: "following Monday"
budget: "$3000"
system_prompt: |
You are a customer booking a flight. Provide information when asked
but don't volunteer everything upfront. Be conversational and natural.
initial_message: "Hi, I'd like to book a flight to Paris."
max_turns: 15

# Intelligent goal evaluation
goal_evaluation_mode: "intelligent"
goal_evaluation_model: "openai:gpt-4o-mini" # Fast, cost-effective model

# Still needed for fallback/compatibility
completion_keywords: ["booked", "confirmed", "reservation number"]

llm:
model: "openai:gpt-4o"
temperature: 0.7
max_tokens: 150
```

### Evaluation Results in Reports

The watch mode now shows detailed evaluation information:

```bash
📊 CONVERSATION COMPLETE
🏁 Status: ✅ SUCCESS
🎯 Goal achieved: Yes
🧠 Evaluation method: intelligent
📊 Confidence: 0.89
💭 Reasoning: The flight has been successfully booked with confirmation number ABC123 provided
```

### Migration Strategy

**Phase 1: Test Intelligent Mode**
```yaml
# Update specific tests to use intelligent evaluation
goal_evaluation_mode: "intelligent"
```

**Phase 2: Adopt Hybrid Mode**
```yaml
# Use hybrid for safety while gaining intelligence
goal_evaluation_mode: "hybrid"
```

**Phase 3: Gradual Rollout**
```yaml
# Eventually make intelligent/hybrid the default for new tests
goal_evaluation_mode: "intelligent"
```

### When to Use Each Mode

| Mode | Use Case | Pros | Cons |
|------|----------|------|------|
| **keywords** | Legacy tests, simple APIs | Fast, deterministic | False positives |
| **intelligent** | Modern apps, complex goals | Accurate, context-aware | Requires LLM |
| **hybrid** | Production, mixed scenarios | Smart + safe fallback | Slightly more complex |

**Recommendation**: Start with `hybrid` mode for new tests to get the benefits of intelligent evaluation with keyword fallback safety.

### 🧪 Try the Example

See a complete example that demonstrates false positive prevention:

```bash
# Download the example test
curl -O https://raw.githubusercontent.com/helixtechnologies/replicantx/main/tests/intelligent_evaluation_example.yaml

# Run with intelligent evaluation
replicantx run intelligent_evaluation_example.yaml --watch

# Compare with keyword-only mode by changing goal_evaluation_mode to "keywords"
```

This example shows how intelligent evaluation distinguishes between:
- ❌ "I'll create a ticket for your issue" (promise)
- ✅ "Your refund has been processed, reference #REF123" (completion)

## 🧠 LLM Integration

ReplicantX uses **PydanticAI** for powerful LLM integration with multiple providers:
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "replicantx"
version = "0.1.4"
version = "0.1.5"
description = "End-to-end testing harness for AI agents via web service API"
readme = "README.md"
requires-python = ">=3.11"
Expand Down
24 changes: 24 additions & 0 deletions replicantx/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,13 @@ class SessionPlacement(str, Enum):
URL = "url" # In URL path (RESTful)


class GoalEvaluationMode(str, Enum):
"""Goal evaluation modes."""
KEYWORDS = "keywords" # Simple keyword matching (legacy behavior)
INTELLIGENT = "intelligent" # LLM-based goal evaluation
HYBRID = "hybrid" # LLM with keyword fallback


class LLMConfig(BaseModel):
"""Configuration for LLM using PydanticAI models."""
model_config = ConfigDict(extra="forbid")
Expand All @@ -90,6 +97,18 @@ class Message(BaseModel):
metadata: Dict[str, Any] = Field(default_factory=dict, description="Additional metadata")


class GoalEvaluationResult(BaseModel):
"""Result of goal evaluation."""
model_config = ConfigDict(extra="forbid")

goal_achieved: bool = Field(..., description="Whether the goal has been achieved")
confidence: float = Field(..., description="Confidence score from 0.0 to 1.0")
reasoning: str = Field(..., description="Explanation of why the goal is/isn't achieved")
evaluation_method: str = Field(..., description="Method used: 'keywords', 'intelligent', or 'hybrid'")
fallback_used: bool = Field(False, description="Whether hybrid mode fell back to keywords")
timestamp: datetime = Field(default_factory=datetime.now, description="When evaluation was performed")


class AssertionResult(BaseModel):
"""Result of an assertion check."""
model_config = ConfigDict(extra="forbid")
Expand Down Expand Up @@ -195,6 +214,11 @@ class ReplicantConfig(BaseModel):
session_placement: SessionPlacement = Field(SessionPlacement.BODY, description="Session ID placement: 'header', 'body', or 'url' (default: body)")
session_variable_name: str = Field("session_id", description="Name of the session variable in header/body (default: session_id)")
llm: LLMConfig = Field(default_factory=LLMConfig, description="LLM configuration for response generation")

# Goal evaluation configuration
goal_evaluation_mode: GoalEvaluationMode = Field(GoalEvaluationMode.KEYWORDS, description="Goal evaluation mode: 'keywords' (default), 'intelligent', or 'hybrid'")
goal_evaluation_model: Optional[str] = Field(None, description="PydanticAI model for goal evaluation (defaults to main LLM model if not specified)")
goal_evaluation_prompt: Optional[str] = Field(None, description="Custom prompt for goal evaluation (uses default if not specified)")


class ScenarioConfig(BaseModel):
Expand Down
38 changes: 27 additions & 11 deletions replicantx/scenarios/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,13 @@ async def run(self) -> ScenarioReport:
# Initialize Replicant agent
self.replicant_agent = ReplicantAgent.create(self.config.replicant)

current_datetime = datetime.now()
date_str = current_datetime.strftime("%A, %B %d, %Y")
time_str = current_datetime.strftime("%I:%M %p %Z")

self._debug_log("Replicant Agent initialized", {
"current_date": date_str,
"current_time": time_str,
"goal": self.config.replicant.goal,
"facts_count": len(self.config.replicant.facts),
"facts": str(self.config.replicant.facts),
Expand All @@ -174,7 +180,13 @@ async def run(self) -> ScenarioReport:

# Initialize watch mode
if self.watch:
current_datetime = datetime.now()
date_str = current_datetime.strftime("%A, %B %d, %Y")
time_str = current_datetime.strftime("%I:%M %p %Z")

self._watch_log("👥 [bold green]LIVE CONVERSATION[/bold green] - Starting agent scenario")
self._watch_log(f"📅 Date: {date_str}")
self._watch_log(f"🕐 Time: {time_str}")
self._watch_log(f"🎯 Goal: {self.config.replicant.goal}")
self._watch_log(f"📝 Facts: {len(self.config.replicant.facts)} items available")
self._watch_log("")
Expand All @@ -191,17 +203,8 @@ async def run(self) -> ScenarioReport:

self._watch_log(f"👤 [bold cyan]User:[/bold cyan] {current_message}")

# Record initial message in conversation history
from ..models import Message
initial_message = Message(
role="user",
content=current_message,
timestamp=datetime.now()
)
self.replicant_agent.state.conversation_history.append(initial_message)

# Continue conversation until completion or limits reached
while self.replicant_agent.should_continue_conversation():
while await self.replicant_agent.should_continue_conversation():
self._debug_log(f"Executing conversation step {step_index + 1}", {
"user_message": current_message,
"turn_count": self.replicant_agent.state.turn_count,
Expand Down Expand Up @@ -254,7 +257,9 @@ async def run(self) -> ScenarioReport:
"parsed_response": parsed_response
})

current_message = await self.replicant_agent.process_api_response(parsed_response)
# For the first response, pass the triggering message to add to conversation history
triggering_message = current_message if step_index == 0 else None
current_message = await self.replicant_agent.process_api_response(parsed_response, triggering_message)

self._debug_log("Generated next user message", {
"next_message": current_message,
Expand Down Expand Up @@ -297,6 +302,17 @@ async def run(self) -> ScenarioReport:
self._watch_log(f"🎯 Goal achieved: {'Yes' if conversation_summary.get('goal_achieved', False) else 'No'}")
self._watch_log(f"📝 Facts used: {conversation_summary.get('facts_used', 0)}")
self._watch_log(f"💬 Total turns: {conversation_summary.get('total_turns', 0)}")

# Add goal evaluation details if available
if 'goal_evaluation_method' in conversation_summary:
method = conversation_summary.get('goal_evaluation_method', 'unknown')
confidence = conversation_summary.get('goal_evaluation_confidence', 0.0)
fallback = conversation_summary.get('goal_evaluation_fallback_used', False)
reasoning = conversation_summary.get('goal_evaluation_reasoning', 'No reasoning provided')

self._watch_log(f"🧠 Evaluation method: {method}" + (" (fallback used)" if fallback else ""))
self._watch_log(f"📊 Confidence: {confidence:.2f}")
self._watch_log(f"💭 Reasoning: {reasoning}")

self._debug_log("Scenario completed successfully", {
"passed": report.passed,
Expand Down
Loading
Loading