Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,015 changes: 1,015 additions & 0 deletions hackathon 2/INTEGRATION_GUIDE.md

Large diffs are not rendered by default.

173 changes: 173 additions & 0 deletions hackathon 2/docs/plans/2026-01-16-extraction-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# Freight Agent - Step 1+2: Extraction Design

**Date:** 2026-01-16
**Status:** Ready to implement

---

## Overview

Build a GPT-powered extraction step that reads freight quote request emails and outputs structured data.

**Approach:** Use OpenAI GPT to parse emails into a structured schema. Keep extracted data raw (no normalization) - fuzzy matching happens in later steps.

---

## Input

Raw email JSON from `hackathon_data/emails/`:

```json
{
"from": "sarah.chen@globalimports.com",
"to": "quotes@freightco.com",
"subject": "Quote Request: Shanghai to Rotterdam",
"body": "Hi,\n\nWe need a quote for:\n\nOrigin: Shanghai\nDestination: Rotterdam\nContainer: 2 x 40ft\nCommodity: Electronics\n\nPlease send your best rate.\n\nThanks,\nSarah"
}
```

---

## Output Schema

```python
{
"sender_email": str, # From email "from" field - needed for SOP lookup

"shipments": [ # Array - emails can have multiple routes (e.g., email_06)
{
"mode": "sea" | "air" | null, # Inferred from context

# Location (raw - no normalization yet)
"origin_raw": str | null, # "HCMC (Saigon)", "ningbo", etc.
"destination_raw": str | null, # "Tokyo Narita", "felixstowe", etc.

# Sea freight specific
"container_size_ft": 20 | 40 | null,
"quantity": int | null, # Number of containers

# Air freight specific
"actual_weight_kg": float | null,
"volume_cbm": float | null,

# Optional
"commodity": str | null
}
],

"missing_fields": list[str], # ["origin city", "container size", "mode"]
"needs_clarification": bool # True if we can't quote without more info
}
```

---

## Mode Detection Logic

GPT should infer mode from these signals:

| Signal | Mode |
|--------|------|
| "container", "20ft", "40ft", "FCL" | Sea |
| "kg", "weight", "CBM", "volume" | Air |
| "ocean", "sea freight" | Sea |
| "air", "air freight", "cargo" | Air |
| Airport codes (SFO, FRA, NRT) | Air |
| Port names only | Sea |

If unclear → set `mode: null` and add to `missing_fields`.

---

## Multi-Route Handling

Email 06 example has multiple routes in one request:
```
Rates from Busan, South Korea to:
1. Hamburg - 2 x 40ft
2. Rotterdam - 1 x 20ft
```

GPT must return multiple shipment objects in the `shipments` array.

---

## Missing Information Detection

If any of these are missing, add to `missing_fields`:

**Sea freight requires:**
- origin (specific city/port, not just "China")
- destination (specific city/port)
- container_size_ft (20 or 40)
- quantity

**Air freight requires:**
- origin
- destination
- actual_weight_kg
- volume_cbm

**Email 03 example** ("ship from China to Poland"):
```python
{
"sender_email": "anna.kowalski@eurotrade.pl",
"shipments": [{
"mode": null,
"origin_raw": "China", # Too vague!
"destination_raw": "Poland", # Too vague!
...
}],
"missing_fields": ["origin city", "destination city", "mode", "container size", "quantity"],
"needs_clarification": true
}
```

---

## Implementation Plan

1. **models.py** - Define dataclasses: `Email`, `Shipment`, `ExtractionResult`
2. **extraction.py** - GPT extraction function:
- Load email JSON
- Build prompt with schema
- Call OpenAI API with structured output
- Parse response into dataclasses
3. **test_extraction.py** - Test against all 10 emails, compare to expected outputs

---

## GPT Prompt Structure

```
System: You are a freight quote extraction assistant. Extract shipping request details from emails.

User: Extract shipment details from this email:
From: {sender}
Subject: {subject}
Body: {body}

Return JSON matching this schema: {schema}

Rules:
- Extract ALL routes if multiple are mentioned
- Keep location names exactly as written (no normalization)
- Infer mode from context (container=sea, kg/CBM=air)
- Set needs_clarification=true if origin/destination are too vague (just country names)
```

---

## Success Criteria

- [ ] Correctly extracts all 10 hackathon emails
- [ ] Multi-route email (06) returns multiple shipments
- [ ] Incomplete email (03) sets `needs_clarification: true`
- [ ] Fuzzy locations kept raw: "HCMC (Saigon)" not normalized yet
- [ ] Mode correctly inferred for all emails

---

## Next Step

After extraction is built and tested, move to **Step 3: Customer Identification** (SOP lookup by sender email).
207 changes: 207 additions & 0 deletions hackathon 2/freight_agent/docs/enrichment_v2_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
# Enrichment v2: Batched + Tool Calling Design

## Overview

Refactored enrichment that:
1. Batches all Qontext queries (REST API, no GPT cost)
2. Single GPT call to parse ALL context
3. Uses tool calling for deterministic validation
4. GPT handles fuzzy matching (names, locations)

## Architecture

```
ExtractionResult
┌─────────────────────────────────────────────────────────────┐
│ QONTEXT QUERIES (REST API - no GPT) │
│ │
│ 1. Query: "Customer with domain @{domain}?" │
│ 2. Query: "Rules for {customer}?" │
│ 3. Query: "Surcharges for {destination}?" (for each dest) │
│ │
│ All responses collected as strings │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ GPT CALL #2 (with tool calling) │
│ │
│ Input: │
│ - All Qontext responses (combined) │
│ - Shipment details (from extraction) │
│ │
│ GPT Tasks: │
│ 1. Parse customer name from context │
│ 2. Parse SOP rules into structured format │
│ 3. Parse surcharges per destination │
│ 4. Normalize locations (HCMC = Saigon = Ho Chi Minh) │
│ 5. Call validate_shipment tool for each shipment │
│ │
│ Output: │
│ - customer_name │
│ - customer_sop (structured) │
│ - enriched_shipments (with surcharges) │
│ - validation_errors │
│ - validation_warnings │
│ - is_valid │
└─────────────────────────────────────────────────────────────┘
EnrichedAndValidatedRequest
```

## Tool Definition

```python
VALIDATION_TOOL = {
"type": "function",
"function": {
"name": "validate_shipment",
"description": "Check if a shipment passes customer SOP restrictions. Call this for EACH shipment after parsing the SOP rules.",
"parameters": {
"type": "object",
"properties": {
"shipment_index": {
"type": "integer",
"description": "Index of the shipment (0-based)"
},
"shipment_mode": {
"type": "string",
"enum": ["sea", "air"],
"description": "The shipping mode requested"
},
"normalized_origin": {
"type": "string",
"description": "Origin normalized to standard name (e.g., 'HCMC' not 'Saigon', 'Ho Chi Minh City')"
},
"mode_restriction": {
"type": ["string", "null"],
"description": "Customer's mode restriction from SOP, or null if none"
},
"origin_restriction": {
"type": ["string", "null"],
"description": "Customer's origin restriction from SOP (normalized), or null if none"
},
"customer_name": {
"type": "string",
"description": "Customer name for error messages"
}
},
"required": ["shipment_index", "shipment_mode", "normalized_origin", "mode_restriction", "origin_restriction", "customer_name"]
}
}
}
```

## Tool Implementation

```python
def validate_shipment(
shipment_index: int,
shipment_mode: str,
normalized_origin: str,
mode_restriction: str | None,
origin_restriction: str | None,
customer_name: str
) -> dict:
"""
Deterministic validation - no fuzzy logic, just exact checks.
GPT already normalized the values before calling.
"""
errors = []

# Check mode restriction
if mode_restriction and shipment_mode != mode_restriction:
errors.append({
"error_type": "mode_restriction",
"message": f"Per your account agreement, {customer_name} is set up for {mode_restriction} freight only.",
"suggestion": f"Would you like a {mode_restriction} freight quote instead?"
})

# Check origin restriction
if origin_restriction and normalized_origin.upper() != origin_restriction.upper():
errors.append({
"error_type": "origin_restriction",
"message": f"Per your account agreement, {customer_name} shipments must originate from {origin_restriction}.",
"suggestion": f"Would you like a quote from {origin_restriction} instead?"
})

return {
"shipment_index": shipment_index,
"is_valid": len(errors) == 0,
"errors": errors
}
```

## GPT System Prompt

```
You are parsing freight customer data from a knowledge graph and validating shipments.

TASKS:
1. Parse the customer name from the context
2. Parse the SOP rules (discounts, margins, restrictions, output requirements)
3. Parse any surcharges that apply to destinations
4. For each shipment, normalize the origin location to a standard name:
- "Ho Chi Minh City", "Saigon", "SGN" → "HCMC"
- "Shanghai", "Pudong" → "Shanghai"
- etc.
5. Call the validate_shipment tool for EACH shipment to check restrictions

IMPORTANT:
- Normalize locations BEFORE calling the validation tool
- The tool does exact string matching, so normalization is critical
- Call the tool once per shipment
```

## Output Schema

```python
@dataclass(frozen=True)
class EnrichedAndValidatedRequest:
"""Combined enrichment + validation result."""
sender_email: str
customer_name: str
customer_sop: CustomerSOP
shipments: tuple[EnrichedShipment, ...]

# Validation results
is_valid: bool
validation_errors: tuple[ValidationError, ...] = ()
validation_warnings: tuple[ValidationWarning, ...] = ()

# Carried forward
missing_fields: tuple[str, ...] = ()
needs_clarification: bool = False
```

## Benefits

| Aspect | Before (3+ calls) | After (1 call + tools) |
|--------|-------------------|------------------------|
| GPT calls | 3+ | 1 |
| Location matching | Hardcoded | GPT (flexible) |
| Validation logic | GPT (might err) | Tool (deterministic) |
| Error messages | GPT (might vary) | Tool (consistent) |

## Flow Summary

```
Extraction (GPT #1)
Qontext queries (REST, free)
Enrichment + Validation (GPT #2 with tools)
├─► GPT parses context
├─► GPT normalizes locations
├─► GPT calls validate_shipment tool (per shipment)
└─► GPT compiles final result
EnrichedAndValidatedRequest
```
Loading