From a05580d8f45d04a34d7cef04a792dc55d519d82b Mon Sep 17 00:00:00 2001
From: Dhiraj Bhat <dhiraj@unit21.ai>
Date: Mon, 2 Mar 2026 07:27:00 -0800
Subject: [PATCH] Improve reference files: graduated challenge, story
 freshness, calibration proxy, transcript formats
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

challenge-protocol.md:
- Graduated activation (Level 3 = Assumption Audit only, Level 4 = + Blind
  Spot, Level 5 = all 5 lenses) instead of binary Level 5 only
- Challenge Debt: unresolved challenges tracked across sessions and surfaced
  at the start of the next relevant session instead of silently dropped
- Avoidance Confrontation now has distinct Level 3/4 behaviors, not just
  a single Level 5 mode

story-mapping-engine.md:
- Framing Variety Constraint replaces "no story twice" rule: a story may
  appear twice in a mapping if the angles are substantively different,
  tested via a 3-criteria Substantiation Test (different competency,
  different STAR section, different interviewer takeaway)
- Story Freshness Score: 3-factor decay function (use frequency risk,
  experience age risk, Proven Performer offset) replacing raw use-count
  overuse check. Verdict: Fresh / Fresh (Proven) / Moderate / Stale

calibration-engine.md:
- Practice Calibration Proxy (Section 8): plateau detection without real
  interview outcomes — when same answer type practiced 5+ times and scores
  vary ≤0.3 across 3 consecutive sessions, flag as practice ceiling and
  prescribe behavioral change. New status: practice-calibrating.
- Score Confidence Display (Section 9): scores carry calibration status
  context (Low/Medium/High confidence) so candidates don't treat estimates
  as ground truth
- De-escalation Check: root causes are explicitly resolved when affected
  dimensions improve, preventing over-diagnosis drift
- Explicit transition rule: practice-calibrating → calibrating when 3+
  real outcomes arrive

transcript-formats.md:
- Added Fireflies.ai as 9th supported format with disambiguation rules
  separating it from Grain
- Zoom Webinar distinction: Panelist: prefix handling
- Format Fingerprint for ambiguous cases: 5-signal fingerprint + medium-
  confidence detection decision, gives candidate basis to override
- Granola quality signal: AI summary cross-checked against raw transcript,
  flags possible truncation or hallucination

decode.md:
- Pathway-weighted batch triage: Referral / Warm intro / LinkedIn /
  Cold / Internal dimension added to ranking. A Long-Shot Stretch with
  a Referral ranks above an Investable Stretch cold application.
- JD Honest Limitations surfaced in every Standard/Deep output as a named
  section (not just reference knowledge). Quick Scans include a one-line
  reminder. Candidates never walk away treating a decode as a complete picture.

feedback.md:
- Memory reliability tagging for Type D (Post-Session Memory): Same-day /
  1-2 days / 3+ days tags applied to Question Bank entries
- Guided elicitation prompts for thin memories: surfaces what's actually
  there rather than accepting vague recall as complete capture
- Rejection Leverage defers to graduated Challenge Protocol levels instead
  of hardcoding Level 5 behavior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 references/calibration-engine.md   |  87 ++++++++++++++--
 references/challenge-protocol.md   | 158 +++++++++++++++++++++--------
 references/commands/decode.md      | 113 ++++++++++++++-------
 references/commands/feedback.md    |  41 ++++----
 references/story-mapping-engine.md | 125 ++++++++++++++++++-----
 references/transcript-formats.md   |  59 +++++++----
 6 files changed, 429 insertions(+), 154 deletions(-)

diff --git a/references/calibration-engine.md b/references/calibration-engine.md
index 2b66cd8..85a1e61 100644
--- a/references/calibration-engine.md
+++ b/references/calibration-engine.md
@@ -12,9 +12,10 @@ The Calibration State section in `coaching_state.md` tracks:
 ## Calibration State
 
 ### Calibration Status
-- Current calibration: [uncalibrated / calibrating / calibrated / miscalibrated]
+- Current calibration: [uncalibrated / practice-calibrating / calibrating / calibrated / miscalibrated]
 - Last calibration check: [date]
 - Data points available: [N] real interviews with outcomes
+- Practice calibration status: [plateau detected / not yet / N/A]
 
 ### Scoring Drift Log
 | Date | Dimension | Direction | Evidence | Adjustment |
@@ -27,12 +28,18 @@ The Calibration State section in `coaching_state.md` tracks:
 
 ### Unmeasured Factor Investigations
 | Date | Trigger | Hypothesis | Investigation | Finding | Action |
+
+### Score Confidence Display
+- Confidence level: [Low / Medium / High]
+- Basis: [practice-only / partial real interview data / calibrated]
+- Last updated: [date]
 ```
 
 **Status definitions:**
-- **Uncalibrated**: Fewer than 3 real interview outcomes. Not enough data for calibration. The system operates normally but cannot verify its own accuracy.
-- **Calibrating**: 3+ outcomes exist, first calibration check in progress. System is building its accuracy model.
-- **Calibrated**: Calibration check completed, practice scores reasonably predict outcomes (correlation exists). System confidence is high.
+- **Uncalibrated**: Fewer than 3 real interview outcomes AND no practice plateau detected. Not enough data for calibration. System operates normally but cannot verify its own accuracy.
+- **Practice-calibrating**: A practice plateau has been detected (Section 8) — the system has identified a ceiling from practice data alone, without real interview outcomes. Internal calibration only; external validation still needed.
+- **Calibrating**: 3+ real outcomes exist, first calibration check in progress. System is building its accuracy model.
+- **Calibrated**: Calibration check completed, practice scores reasonably predict outcomes. System confidence is high.
 - **Miscalibrated**: Practice scores do not predict outcomes, or external feedback consistently contradicts coach scoring. System is actively investigating and adjusting.
 
 ---
@@ -58,7 +65,7 @@ For each of the 5 dimensions:
 Cross-reference external feedback with coach scores:
 - If recruiter feedback says "great storytelling, hard to follow" but the coach scored Structure at 4, that's a drift signal
 - If feedback says "polished but generic" but the coach scored Differentiation at 3+, that's a drift signal
-- Weight external feedback higher than internal scoring — the interviewer is the ground truth
+- **Weight external feedback higher than internal scoring — the interviewer is the ground truth.** This is non-negotiable. Internal scores are estimates; external feedback is a direct signal from the decision-maker.
 
 ### Step 4: Generate Drift Report
 For each dimension showing drift:
@@ -70,7 +77,7 @@ For each dimension showing drift:
 ### Step 5: Present to Candidate
 Frame drift corrections as improved predictive accuracy, NOT as goalpost-moving:
 - "I've been scoring your Structure at 4, but interviewer feedback consistently suggests it's landing at 3. I'm recalibrating so my scores better predict how real interviewers will evaluate you. This isn't a downgrade — it's a more accurate picture."
-- Record the adjustment in the Scoring Drift Log and Calibration Adjustments tables
+- Record the adjustment in the Scoring Drift Log and Calibration Adjustments tables.
 
 ---
 
@@ -95,15 +102,17 @@ Example: "Conflict avoidance" detected in Session 3 (affecting Substance on Q2 a
 - The pattern no longer appears in new analyses
 - Update status to "Resolved" with date and what worked
 
+**De-escalation Check**: When all affected dimensions have improved 1+ points for 2+ consecutive sessions, check whether the root cause is still active — or whether it's been treated and should be resolved. Don't leave root causes as "Active" forever. An over-diagnosed root cause becomes noise.
+
 ---
 
 ## Section 4: Temporal Decay for Intelligence Data
 
-Apply these freshness rules when referencing intelligence data in prep, progress, or any workflow:
+Apply these freshness rules when referencing intelligence data in prep, progress, or any workflow. "Intelligence data" includes: Interview Loops entries in `coaching_state.md` (question bank, company patterns, effective/ineffective patterns, recruiter/interviewer feedback). Story-specific freshness is tracked separately via the Story Freshness Score in `references/story-mapping-engine.md` Section 5 — both systems work in parallel, not in substitution for each other.
 
 | Data Type | Current | Historical | Archive |
 |---|---|---|---|
-| **Question Bank entries** | < 3 months old | 3-6 months old — flag as "historical, may not reflect current process" | > 6 months old — archive only, don't weight in predictions |
+| **Question Bank entries** | < 3 months old | 3–6 months old — flag as "historical, may not reflect current process" | > 6 months old — archive only, don't weight in predictions |
 | **Effective/Ineffective Patterns** | < 3 months old | 3+ months old — re-test before relying. "This pattern was confirmed 4 months ago. Let's check if it still holds." | N/A — patterns don't expire, but confidence decays |
 | **Company Patterns** | < 6 months old | > 6 months old — flag: "Company data is from [date]. Processes and culture may have changed." | > 12 months — treat as background context only |
 | **Recruiter/Interviewer Feedback** | Current loop (active interview process) = high-signal | Closed loops = context only, not actionable | N/A |
@@ -215,12 +224,72 @@ Ask the candidate about the interviews where prediction failed:
 3. Track results explicitly in the Unmeasured Factor Investigations table
 
 ### Step 4: Resolution
-- **Confirmed**: The factor is real. Add it to Effective Patterns (or Ineffective Patterns). Integrate into coaching: "Before your next interview, do a 10-minute warmup — your energy level matters as much as your content."
+- **Confirmed**: The factor is real. Add it to Effective Patterns (or Ineffective Patterns). Integrate into coaching.
 - **Inconclusive**: Not enough data. Continue tracking.
 - **Closed**: The hypothesis didn't hold. Remove from active investigation.
 
 ---
 
+## Section 8: Practice Calibration Proxy
+
+**The problem**: Real interview outcomes require 3+ interviews to trigger calibration. But most candidates need calibration insights much earlier — and practice alone can reveal meaningful ceiling signals without waiting for external data.
+
+**What a plateau means**: When the same answer type is practiced 5+ times and scores stop improving across 3+ consecutive sessions on the same dimension, that's a practice ceiling — not a real-interview ceiling. It means the candidate has extracted all the improvement available from repetition alone. **The next point of improvement requires behavioral change, not more practice.**
+
+### Plateau Detection
+
+During `practice`, after scoring, check Score History for the current practice dimension:
+- If the same answer type (same competency category) has been practiced 5+ times
+- AND the dimension score has varied by ≤ 0.3 across the last 3 sessions
+- → Flag as a practice plateau
+
+**How to surface it**: "Your [dimension] scores on [competency] answers have plateaued at ~[score] across the last [N] sessions. More practice on the same material won't move this. The ceiling isn't your skill — it's the approach. Here's what actually changes it: [specific behavioral intervention]."
+
+**What changes a plateau**: Practice repetition builds delivery fluency. Plateaus break when the underlying behavior changes — tension in stories that currently avoid conflict, specificity in answers that currently generalize, or earned secrets in answers that currently sound generic. Identify the specific behavioral lever, prescribe it, then re-practice.
+
+### Practice Calibration State Update
+
+When a plateau is detected:
+1. Update Calibration Status to "Practice-calibrating"
+2. Log in Calibration Adjustments: "Practice plateau detected on [dimension] for [competency] type answers — ceiling identified at ~[score] from practice data. Behavioral intervention prescribed: [intervention]."
+3. After the behavioral intervention is attempted (next 2–3 sessions), re-check whether scores move. If they do: the intervention worked, plateau resolved. If they don't: the intervention was wrong — escalate to root cause tracking.
+
+**Transition to real-interview calibration**: Once 3+ real interview outcomes are recorded in the Outcome Log, immediately run Scoring Drift Detection (Section 2). Update Calibration Status from "Practice-calibrating" to "Calibrating." Real-interview outcomes supersede practice plateau data — the practice calibration informs the starting hypothesis, but external outcomes are the ground truth. Do not carry forward practice calibration adjustments as facts; carry them forward as hypotheses to be validated.
+
+**Honest boundary**: Practice calibration is internal only. It tells the candidate where practice stops being useful. It cannot replace real interview outcomes for verifying external scoring accuracy. Both are needed.
+
+---
+
+## Section 9: Score Confidence Display Protocol
+
+Every score presented to the candidate should carry an implicit or explicit confidence signal based on calibration status. This prevents scores from being treated as ground truth when they're estimates.
+
+### Confidence Levels
+
+| Calibration Status | Score Confidence | How to Frame |
+|---|---|---|
+| Uncalibrated | Low | "This score is my best estimate — I don't yet have real interview data to verify it's accurate for you." |
+| Practice-calibrating | Low–Medium | "This score reflects your practice ceiling. Real interview data may show it differently." |
+| Calibrating | Medium | "This score is based on [N] real interview outcomes — confidence is building." |
+| Calibrated | High | "This score has been validated against your actual interview outcomes." |
+| Miscalibrated | Low | "I'm actively recalibrating — treat this score as directional, not precise." |
+
+### When to Surface Confidence Explicitly
+
+Surface it when:
+- The candidate is making a major decision based on scores (e.g., "Am I ready to apply?")
+- Scores are being used to benchmark against a threshold ("I need to hit 4.0 before I interview")
+- A score hasn't moved in 3+ sessions (plateau territory)
+- The score contradicts external feedback
+
+Don't surface it every session — it becomes noise. Surface it when it matters.
+
+### The Core Principle
+
+Scores are coaching tools, not grades. A 3.5 from an uncalibrated system means something different from a 3.5 from a calibrated system. Making the difference visible keeps the candidate from over-indexing on numbers and under-indexing on the behavioral insight behind them.
+
+---
+
 ## Integration with Story Mapping Engine
 
 - When scoring drift adjusts a dimension score, flag stories in the storybank whose strength ratings were heavily influenced by that dimension. They may need re-evaluation. Example: "Your Substance scores have been recalibrated down by 0.5. S003 and S007 were rated based on Substance evidence — consider re-scoring them."
diff --git a/references/challenge-protocol.md b/references/challenge-protocol.md
index 55168f8..64a16d6 100644
--- a/references/challenge-protocol.md
+++ b/references/challenge-protocol.md
@@ -1,20 +1,35 @@
-# Challenge Protocol (Directness Level 5)
+# Challenge Protocol
 
-A cross-cutting challenge framework that deepens coaching rigor at Level 5. Invoked by multiple commands — not a standalone command or mode. At Levels 1-4, this protocol is inactive and all coaching behavior remains unchanged.
+A cross-cutting challenge framework that deepens coaching rigor at Levels 3, 4, and 5. Invoked by multiple commands — not a standalone command or mode.
+
+---
+
+## Graduated Activation Levels
+
+Challenge depth scales with the candidate's chosen directness level. This is a departure from the binary (Level 5 only) model: challenge begins earlier, at reduced intensity, so it becomes familiar rather than jarring.
+
+| Level | Challenge Scope | What Activates |
+|---|---|---|
+| **1–2** | No challenge | All coaching behavior unchanged from base mode |
+| **3** | Lens 1 only — Assumption Audit | Surfaces the assumptions the candidate is resting on |
+| **4** | Lenses 1–2 — Assumption Audit + Blind Spot Scan | Adds what the candidate can't see from inside their own experience |
+| **5** | All 5 Lenses | Full challenge protocol as designed below |
+
+**Design principle**: Level 3 and 4 challenges are delivered in a single sentence or two — a quick provocation, not a full analysis. Level 5 challenges are the full lens invocation as specified per command below.
 
 ---
 
 ## The Five Lenses
 
-Every challenge invocation uses some or all of these lenses:
+Every challenge invocation uses some or all of these lenses, depending on level:
 
 | Lens | Question | Interview Coaching Application |
-|------|----------|-------------------------------|
-| **Assumption Audit** | What must be true for this to work? | "This story assumes the interviewer values speed over thoroughness. What if they don't?" |
-| **Blind Spot Scan** | What can't the candidate see? | "You've told this story 5 times in practice. You know every beat. An interviewer hearing it fresh doesn't have your context." |
-| **Pre-Mortem** | Imagine this failed — why? | "It's 48 hours from now. You didn't advance. What went wrong?" |
-| **Devil's Advocate** | The strongest case against | "If I were the hiring manager looking for reasons not to advance you, here's what I'd point to..." |
-| **Strengthening Path** | How to make it airtight | "Add [specific detail] and the attack surface shrinks to near zero." |
+|---|---|---|
+| **1 — Assumption Audit** | What must be true for this to work? | "This story assumes the interviewer values speed over thoroughness. What if they don't?" |
+| **2 — Blind Spot Scan** | What can't the candidate see? | "You've told this story 5 times in practice. You know every beat. An interviewer hearing it fresh doesn't have your context." |
+| **3 — Pre-Mortem** | Imagine this failed — why? | "It's 48 hours from now. You didn't advance. What went wrong?" |
+| **4 — Devil's Advocate** | The strongest case against | "If I were the hiring manager looking for reasons not to advance you, here's what I'd point to..." |
+| **5 — Strengthening Path** | How to make it airtight | "Add [specific detail] and the attack surface shrinks to near zero." |
 
 ---
 
@@ -22,81 +37,101 @@ Every challenge invocation uses some or all of these lenses:
 
 ### Story Red Team — `stories add` / `stories improve`
 
-Run all 5 lenses against the story after it's been added or improved:
+**Level 3**: After the story is added or improved, run Lens 1 only:
+- **Assumption Audit**: What must be true for this story to land with the interviewer? Name the single strongest assumption.
 
+**Level 4**: Lenses 1–2:
+- **Assumption Audit** (as above)
+- **Blind Spot Scan**: What's invisible to the candidate about their own story? What context do they take for granted that an interviewer won't have?
+
+**Level 5**: All 5 lenses:
 1. **Assumption Audit**: What must be true for this story to land? What's the interviewer's implicit framework, and does this story fit it?
 2. **Blind Spot Scan**: What's invisible to the candidate about their own story? What context do they take for granted that an interviewer won't have?
 3. **Pre-Mortem**: How does this story fail in a real interview? Where does it lose the interviewer's attention, raise doubt, or feel thin?
 4. **Devil's Advocate**: Where does a skeptical interviewer attack? What follow-up questions would expose weaknesses?
 5. **Strengthening Path**: One specific change that makes the story airtight. Not a list — the single highest-leverage fix.
 
-At Levels 1-4: Skip. The standard improve diagnostic is sufficient.
+At Levels 1–2: Skip entirely.
+
+---
 
 ### Transcript Challenge — `analyze`
 
-Run Lenses 1-4 against the overall interview performance. Lens 5 feeds into the Priority Move.
+**Level 3**: After scoring, add one sentence only:
+- **Assumption Audit**: Surface the single strongest assumption the candidate's overall performance rested on.
 
-- **Assumption Audit**: What 2-3 hidden assumptions did this interview rest on? (e.g., "Assumed the interviewer already understood my domain," "Assumed storytelling polish compensates for thin substance")
+**Level 4**: Lenses 1–2 brief pass:
+- **Assumption Audit**: 1-2 hidden assumptions the interview rested on.
+- **Blind Spot Scan**: One thing the candidate can't see about their own performance from inside it.
+
+**Level 5**: Run Lenses 1–4 against the overall interview performance. Lens 5 feeds into Priority Move.
+- **Assumption Audit**: What 2–3 hidden assumptions did this interview rest on? (e.g., "Assumed the interviewer already understood my domain," "Assumed storytelling polish compensates for thin substance")
 - **Blind Spot Scan**: What can't the candidate see about their own performance from inside it?
 - **Pre-Mortem**: If this interview doesn't result in advancement, why? Based on what actually happened, not hypotheticals.
 - **Devil's Advocate**: The strongest case for passing on this candidate, built from the transcript evidence.
 - Lens 5 (Strengthening Path) feeds directly into Priority Move — the single highest-leverage action for the next 72 hours.
 
-At Levels 1-4: Skip. Standard analysis is sufficient.
+At Levels 1–2: Skip entirely.
+
+---
 
 ### Round Challenge — `practice` (rounds 3+)
 
-Apply one lens per round, rotated: Assumption → Blind Spot → Pre-Mortem → Devil's Advocate → cycle back. Keep to 1-2 sentences. The goal is a quick, sharp provocation that pushes the candidate to think differently about what they just said — not a full analysis.
+**Level 3**: After each round (from round 3 onward), add one Assumption Audit sentence — a single pointed question about what the candidate's answer assumed. Keep to 1 sentence. Rotate through stories: don't repeat the same assumption challenge twice.
+
+**Level 4**: Alternate between Lens 1 (Assumption) and Lens 2 (Blind Spot) per round, cycling: round 3 = Assumption, round 4 = Blind Spot, round 5 = Assumption, etc. Keep to 1–2 sentences per lens.
+
+**Level 5**: Apply one lens per round, rotated: Assumption → Blind Spot → Pre-Mortem → Devil's Advocate → cycle back. Keep to 1–2 sentences. The goal is a quick, sharp provocation that pushes the candidate to think differently — not a full analysis.
 
 The resolution for each round challenge comes from the standard round protocol's Next Round Adjustment (Step 9), which provides the concrete, actionable fix.
 
-At Levels 1-4: Skip. Standard round feedback is sufficient.
+At Levels 1–2: Skip entirely.
+
+---
 
 ### Hard Truth — `progress`
 
-Based on all accumulated data (Score History trends, storybank gaps, avoidance patterns from Coaching Notes, self-assessment deltas, outcome patterns), identify the single most important uncomfortable truth the candidate needs to hear. One paragraph. No softening. No "but here's the good news." Just the truth.
+**Level 3**: Deliver the Hard Truth (the single most important uncomfortable insight from accumulated data), but with one brief empathetic frame before the hard statement: acknowledge the effort, then state the truth. Still no softening of the substance — just the frame.
 
-This is the moment where the coach earns the candidate's trust by saying what no one else will. Draw from:
-- Score History trends that haven't moved despite effort
-- Storybank gaps the candidate keeps avoiding
-- Avoidance patterns captured in Coaching Notes
-- Self-assessment deltas that suggest the candidate can't see their own weaknesses
-- Outcome patterns that point to a systemic issue
+**Level 4**: Hard Truth with Assumption Audit framing: before stating the truth, name the assumption the candidate has been operating under that makes the truth hard to see. "You've been assuming [X]. Here's what the data actually shows."
 
-Lens 5 (Strengthening Path) is delivered through the progress review's Top 2 Priorities section, which immediately follows the Hard Truth — the candidate is never left with just the diagnosis.
+**Level 5**: Full Hard Truth with no softening, no frame, grounded in all accumulated data (Score History trends, storybank gaps, avoidance patterns, self-assessment deltas, outcome patterns). One paragraph. No "but here's the good news." Just the truth.
 
-At Levels 1-4: Omit entirely.
+In all levels: Lens 5 (Strengthening Path) is always delivered immediately after the Hard Truth through the progress review's Top 2 Priorities section. The candidate is never left with just the diagnosis.
+
+At Levels 1–2: Omit the Hard Truth section entirely.
+
+---
 
 ### Pre-Mortem — `hype`
 
-The honest counterweight to the confidence boost. After the 60-Second Hype Reel, before the Pre-Call 3x3:
+**Level 3**: Skip. Hype stays pure boost at Level 3.
 
-Present 2-3 most likely failure modes for this specific interview, each with a one-line prevention cue. Source from:
-- Active Coaching Strategy bottleneck
-- Storybank gaps for this company/role
-- Self-assessment calibration tendency (over-rater may not self-correct in the moment)
-- Avoidance patterns from Coaching Notes
-- Previous rejection feedback from similar companies
+**Level 4**: One failure mode only — the single most likely risk for this specific interview, with a one-line prevention cue. Keep it brief. End immediately with the release cue.
 
-End with the release cue: "You know these risks. Now set them aside and go execute." The pre-mortem's purpose is to move failure anxiety from the subconscious (where it causes freeze) to the conscious (where it becomes actionable). Once acknowledged, let it go.
+**Level 5**: Full pre-mortem. After the 60-Second Hype Reel, before the Pre-Call 3x3:
+- Present 2–3 most likely failure modes for this specific interview, each with a one-line prevention cue.
+- Source from: Active Coaching Strategy bottleneck, storybank gaps for this company/role, self-assessment calibration tendency, avoidance patterns, previous rejection feedback from similar companies.
+- End with the release cue: "You know these risks. Now set them aside and go execute." The pre-mortem's purpose is to move failure anxiety from the subconscious to the conscious. Once acknowledged, let it go.
 
-At Levels 1-4: Skip entirely. Hype stays pure boost.
+At Levels 1–2: Skip entirely.
+
+---
 
 ### Rejection Leverage — `feedback` (Type B rejection outcomes)
 
-Don't lead with comfort. Lead with extraction: "What can we extract from this?"
+**Level 3**: Skip. Standard emotional triage only — empathy first, then learning extraction follows.
 
-Run Lenses 1-3 retrospectively:
+**Level 4**: After standard emotional triage, add Lens 1 retrospectively: "What assumptions were you making about this company/role/interview that turned out not to be true?" One question, one round of exploration.
+
+**Level 5**: Don't lead with comfort. Lead with extraction: "What can we extract from this?"
+Run Lenses 1–3 retrospectively:
 1. **Assumptions**: What assumptions were wrong about this company/role/interview? What did you believe going in that turned out not to be true?
 2. **Blind Spots**: What does this rejection reveal that you couldn't see before? What pattern is now visible that wasn't?
 3. **Pre-Mortem (retrospective)**: With hindsight, what was the pre-mortem you should have done? What failure modes were predictable?
+Then: concrete adjustments for the next similar interview, pattern detection against Outcome Log, and close: "Rejection is data. This data says [specific insight]. Here's what we do with it."
 
-Then:
-- Concrete adjustments for the next similar interview
-- Pattern detection: Does this match other rejections in the Outcome Log?
-- Close: "Rejection is data. This data says [specific insight]. Here's what we do with it."
-
-At Levels 1-4: Standard emotional triage from the Psychological Readiness Module. Learning extraction follows empathy, not leads.
+At Levels 1–2: Standard emotional triage from the Psychological Readiness Module. Learning extraction follows empathy, not leads.
 
 ---
 
@@ -105,7 +140,6 @@ At Levels 1-4: Standard emotional triage from the Psychological Readiness Module
 ### Detection Signals
 
 Track these across sessions. Three or more instances of the same pattern constitutes avoidance:
-
 - Skips the same competency across multiple gap reviews
 - Chooses "safe" drill types, avoids pushback/stress drills
 - Changes subject when a specific weakness is raised
@@ -118,12 +152,46 @@ Name it directly: "I've noticed you've steered away from [topic] three times now
 
 Stay in the discomfort. Don't offer an escape route. Don't pivot to something easier. The candidate chose Level 5 because they want to be pushed — honor that choice. Once the candidate engages with the discomfort, pivot into the avoided topic with a concrete drill or exercise.
 
-### At Levels 1-4
+### At Level 4
+
+Gentle naming: "I've noticed we tend to skip [topic]. I want to name that directly. Would you be willing to go there today?" Respect agency, but don't let the pattern slide. If the candidate declines, note it in Coaching Notes.
+
+### At Levels 1–3
 
 Note the pattern in Coaching Notes. Raise gently during meta-checks: "I've noticed we tend to skip [topic]. Would it be useful to spend some time there?" Respect the candidate's pace.
 
 ---
 
+## Challenge Debt
+
+A challenge is issued but the session ends before resolution — or the candidate deflects without resolution. This creates a **Challenge Debt**: an open challenge that shouldn't be dropped.
+
+### Recording Challenge Debt
+
+When a Level 4 or 5 challenge goes unresolved, record it in Coaching Notes:
+
+```
+Challenge Debt: [the challenge issued, in one sentence] — [session date] — pending
+```
+
+### Surfacing Challenge Debt
+
+At the start of the next relevant session (not every session — only when the same command or topic is invoked), surface it:
+
+"Last session I challenged [X]. I want to pick that back up before we go further — did you have a chance to sit with it?"
+
+If the candidate has resolved it: update the record to "resolved" with their insight. If they haven't: re-issue the challenge before proceeding.
+
+**Challenge Debt is not a punishment.** It's a tracking mechanism that prevents coaching from moving forward while important unresolved questions are pending. Candidates who ignore challenge debts are building on unstable ground.
+
+### Debt Expiry
+
+Challenge Debt expires after 3 sessions without the relevant topic being revisited. At that point, close it: "We moved past [challenge] without resolution. I'm closing it, but note that [the original challenge] may resurface when you're in a real interview." Archive in Coaching Notes, don't re-raise.
+
+---
+
 ## Key Design Principle
 
-**Challenge without resolution is cruelty.** Every challenge invocation ends with a concrete, actionable fix — either through Lens 5 directly (Story Red Team, Transcript Challenge → Priority Move, Pre-Mortem prevention cues, Rejection Leverage adjustments) or through an existing resolution mechanism (Round Challenge → Next Round Adjustment, Hard Truth → Top 2 Priorities). The coach attacks to strengthen, not to demoralize. The candidate should leave every challenge interaction knowing exactly what to do differently — not just feeling bad about what they did wrong.
+**Challenge without resolution is cruelty.** Every challenge invocation — at any level — ends with a concrete, actionable fix. Lens 5 (Strengthening Path) provides the fix directly at Story Red Team, Transcript Challenge (→ Priority Move), and Rejection Leverage (→ Concrete adjustments). Existing resolution mechanisms carry the fix at Round Challenge (→ Next Round Adjustment), Hard Truth (→ Top 2 Priorities), and Pre-Mortem in `hype` (→ Prevention cues + release). No challenge ends without telling the candidate exactly what to do differently.
+
+**Graduated challenge is not a weakened challenge.** A Level 3 Assumption Audit delivered with precision is more valuable than a Level 5 spray of all lenses. Calibrate intensity to the level, not just quantity. One sentence that lands is worth more than five that overwhelm.
diff --git a/references/commands/decode.md b/references/commands/decode.md
index a7d625e..783e0f3 100644
--- a/references/commands/decode.md
+++ b/references/commands/decode.md
@@ -19,7 +19,7 @@
 5. **Between-the-lines signals** — Euphemisms decoded. "Fast-paced environment" = likely understaffed or rapid change. "Comfortable with ambiguity" = undefined role or early-stage. "Stakeholder management" = political environment. "Wear many hats" = no clear role boundaries. These are interpretations, not facts — confidence-label accordingly.
 6. **What's missing** — Absence is information. A PM JD that doesn't mention data/analytics signals team maturity (or lack of PM influence on data). An engineering JD that doesn't mention testing may indicate culture. A senior role with no mention of mentoring signals an IC-heavy org.
 
-**What JDs Don't Tell You** (honest limitations):
+**What JDs Don't Tell You** (honest limitations — surfaced in every decode output, not just reference knowledge):
 - Actual day-to-day work (JDs describe the ideal, not the reality)
 - Team dynamics, manager quality, internal politics
 - Whether the role is backfill (someone left) vs. new headcount (growth)
@@ -28,12 +28,15 @@
 - Internal candidate preference
 - Hiring timeline urgency
 
+These limitations appear in every Standard and Deep decode output as a "JD Honest Limitations" section. Quick Scans include a single-line reminder. The candidate should never walk away treating a decode as a complete picture.
+
 **Batch Triage Research**:
-- Average job search involves 100-200 applications. Hit rate is 2-8%.
-- Time per application: 30-90 minutes (resume tailoring, cover letter, research).
-- Improving triage accuracy by 20% saves 20-40 hours of application effort.
+- Average job search involves 100–200 applications. Hit rate is 2–8%.
+- Time per application: 30–90 minutes (resume tailoring, cover letter, research).
+- Improving triage accuracy by 20% saves 20–40 hours of application effort.
 - Signal: candidates who apply to Strong Fit roles have materially better outcomes than those who spray-and-apply.
 - The batch comparison reveals the candidate's "sweet spot" — competencies that recur across JDs they're drawn to — which is itself career intelligence.
+- **Pathway matters as much as fit.** A Long-Shot Stretch role with a strong referral often outperforms a Strong Fit cold application. Triage that ignores access is incomplete.
 
 ### Confidence Labeling System
 
@@ -57,21 +60,22 @@ Every interpretation gets a confidence label. This is non-negotiable — it's wh
 ### Required Inputs
 
 - At least one JD (pasted text or key details)
-- For batch triage: 2-5 JDs
+- For batch triage: 2–5 JDs
 
 ### Optional Inputs
 
 - Depth level: Quick Scan / Standard / Deep Decode (default: Standard)
 - Specific questions about the JD
 - Target priority (is this a top target or early-stage exploration?)
+- **Pathway** (for each JD in a batch): Do you have a referral, warm intro, or connection at this company? This affects triage ranking.
 
 ### Depth Levels
 
 | Level | When to Use | What It Covers |
 |---|---|---|
-| **Quick Scan** | Fast filter, building a target list, evaluating whether to apply | Top 5 competencies extracted + fit verdict (if Profile exists) + 3 key signals + confidence labels + recruiter verification questions for LOW/UNKNOWN items |
-| **Standard** | Default. Full JD analysis. | All 6 lenses + full competency extraction + fit assessment + concern mapping + question-prediction seeds + recruiter verification questions + teaching layer |
-| **Deep Decode** | High-priority target, unusual JD, career transition, seniority mismatch | All of Standard + seniority calibration (what level this JD actually targets vs. what it says) + team maturity analysis + JD structural quality assessment + Challenge Protocol (Level 5) |
+| **Quick Scan** | Fast filter, building a target list, evaluating whether to apply | Top 5 competencies extracted + fit verdict (if Profile exists) + 3 key signals + confidence labels + recruiter verification questions for LOW/UNKNOWN items + one-line JD limitations reminder |
+| **Standard** | Default. Full JD analysis. | All 6 lenses + full competency extraction + fit assessment + concern mapping + question-prediction seeds + recruiter verification questions + JD Honest Limitations section + teaching layer |
+| **Deep Decode** | High-priority target, unusual JD, career transition, seniority mismatch | All of Standard + seniority calibration + team maturity analysis + JD structural quality assessment + Challenge Protocol (Level 5) |
 
 ### Logic / Sequence (7 steps)
 
@@ -81,6 +85,8 @@ Accept JD in any format — full posting, bullet-point paste, screenshot descrip
 **Step 2: Context Assembly**
 Pull from coaching_state.md: Profile (target role, seniority band), Resume Analysis (skills, experience, positioning strengths), Storybank (skills coverage), Positioning Statement (key differentiator), active Interview Loops (is this company already in progress?), JD Analyses (previous decodes for pattern comparison).
 
+For batch triage, also note pathway information provided: which JDs have referrals, warm intros, connections, or cold access.
+
 **Step 3: 6-Lens Decode**
 Apply all 6 decoding lenses. For each finding:
 - State the finding
@@ -99,7 +105,7 @@ For Lens 5 (between-the-lines signals), use the "decode + verify" pattern:
 For Lens 6 (what's missing), explicitly state what the absence might mean AND what it might not mean. Absence is ambiguous — don't present a single interpretation as truth.
 
 **Step 4: Competency Extraction**
-Extract top 5-7 competencies in priority order. For each:
+Extract top 5–7 competencies in priority order. For each:
 - The competency (specific, not vague)
 - Source section(s) (where in the JD it appears)
 - Confidence level
@@ -117,32 +123,47 @@ Produce fit verdict using the Role-Fit Assessment from cross-cutting.md (5 dimen
 
 Verdict: Strong Fit / Investable Stretch / Long-Shot Stretch / Weak Fit
 
-For Long-Shot Stretch and Weak Fit: be honest. "This JD targets [X], and your profile shows [Y]. You could apply, but you should know the gap is structural, not just a framing issue." Don't waste the candidate's time encouraging applications to roles they're not competitive for — unless they have a specific strategy (referral, internal transfer, etc.).
+For Long-Shot Stretch and Weak Fit: be honest. Don't waste the candidate's time encouraging applications to roles they're not competitive for — unless they have a specific strategy (referral, internal transfer, etc.).
 
 **Step 6: Recruiter Verification Questions**
 Compile all MEDIUM, LOW, and UNKNOWN confidence interpretations into a prioritized list of questions to ask the recruiter. For each:
 - What you want to verify
 - Why it matters for your decision/preparation
-- How to ask it naturally (not "your JD said X, what does that mean?" but "Can you tell me more about [topic]?")
+- How to ask it naturally
 
 Order by impact on the apply/don't-apply decision.
 
 **Step 7: Batch Triage (if multiple JDs provided)**
-For each JD: run Steps 1-6 individually (can be abbreviated for Quick Scan depth).
+For each JD: run Steps 1–6 individually (can be abbreviated for Quick Scan depth).
+
+Then produce comparative analysis with pathway-weighted ranking:
+
+**Pathway Classification** (assess for each JD before ranking):
+- **Referral**: Someone at the company has agreed to refer you. Highest-access tier.
+- **Warm intro**: A mutual connection can introduce you to someone at the company. Second-highest access.
+- **LinkedIn connection**: You have a 1st-degree connection at the company who may respond to outreach. Moderate access.
+- **Cold**: No connection. Standard application path only.
+- **Internal**: You're already at the company (internal transfer or promotion). Different rules apply.
 
-Then produce comparative analysis:
-- **Rank by fit** (highest fit first, with evidence for each)
+**Pathway-weighted ranking logic**:
+- Base ranking: fit level (Strong Fit > Investable Stretch > Long-Shot Stretch > Weak Fit)
+- Pathway boost: A role with a Referral gets a +1 tier boost in ranking. A role with a Warm intro gets a +0.5 tier boost. Cold applications get no boost.
+- Example: A Long-Shot Stretch with a Referral ranks above an Investable Stretch with a Cold application if the access is genuine — because the referral changes the odds more than the raw fit gap.
+- Flag the logic transparently: "This role is ranked #2 despite being Investable Stretch because you have a referral, which materially increases your odds relative to the Strong Fit cold application at #1."
+
+Then produce:
+- **Rank by pathway-weighted fit** (highest first, with evidence for each)
 - **Overlapping competencies** across JDs — this reveals the candidate's market-validated sweet spot
-- **Divergent requirements** — where JDs pull in different directions, signaling scope decisions the candidate needs to make
-- **Allocation recommendation**: which to pursue first, which to skip, which to research more. Honest about skips — "This one is a Long-Shot Stretch because [specific reason]. Unless you have an inside connection, your time is better spent on the others."
-- **Target profile synthesis** — what these JDs collectively reveal about what the market wants from someone like the candidate
+- **Divergent requirements** — where JDs pull in different directions
+- **Allocation recommendation**: which to pursue first, which to skip, which to research more
+- **Target profile synthesis** — what these JDs collectively reveal about what the market wants
 
 ### Teaching Layer
 
-This is what differentiates decode from prep's JD parsing. Quick Scan includes a lighter version (a single "Decode Tip" — one teaching insight per scan, rotating through the 6 lenses). Standard and Deep include the full "Learn to Read This Yourself" section:
-- **Pattern spotted**: A specific decoding insight from THIS JD that the candidate can apply to future JDs (e.g., "Notice how 'drive strategy' appeared in responsibilities but 'execute campaigns' appeared in requirements — the responsibilities describe the aspirational version of the role, the requirements describe the actual bar")
-- **Trap to watch for**: A common misread THIS JD exemplifies (e.g., "This JD lists 12 'required' qualifications. No candidate has all 12. In practice, 6-8 match is strong. Don't self-screen out based on a wish list.")
-- **Self-decode prompt**: A question the candidate should ask themselves when reading the NEXT JD (rotate through a set: "What verb does the first responsibility use?", "Count how many times 'collaboration' or 'cross-functional' appears", "What's conspicuously absent?")
+Quick Scan includes a "Decode Tip" — one teaching insight per scan, rotating through the 6 lenses. Standard and Deep include the full "Learn to Read This Yourself" section:
+- **Pattern spotted**: A specific decoding insight from THIS JD
+- **Trap to watch for**: A common misread THIS JD exemplifies
+- **Self-decode prompt**: A question the candidate should ask themselves when reading the NEXT JD
 
 ### Output Schemas
 
@@ -169,8 +190,11 @@ This is what differentiates decode from prep's JD parsing. Quick Scan includes a
 1. [Question — for the highest-impact LOW/UNKNOWN item]
 2. [Question — second priority]
 
+## JD Reminder
+This decode reflects what the JD says — not what the job actually is. Key unknowns: day-to-day reality, team dynamics, whether this is backfill or new headcount, and real comp range.
+
 ## Decode Tip
-[One teaching insight from this JD — e.g., "This JD lists 14 'requirements' — that's a wish list, not a checklist. 8-10 matches is strong." Rotate through the 6 lenses across scans so the candidate builds decoding intuition over time.]
+[One teaching insight from this JD]
 
 **Recommended next**: `decode` (Standard) for full analysis. **Alternatives**: `prep [company]`, `research [company]`
 ```
@@ -231,6 +255,14 @@ Based on top competencies and identified gaps:
 2. [Question]
 3. [Question]
 
+## JD Honest Limitations
+What this decode can't tell you — and what to do about it:
+- **Day-to-day reality**: JDs describe the ideal, not the actual work. Ask the hiring manager: "Can you walk me through what a typical week looks like for this role right now?"
+- **Team and manager quality**: Not in any JD. Ask to meet the team and ask: "What's the team's working style?"
+- **Backfill vs. new headcount**: Signals different urgency and role clarity. Ask: "Is this a new role or one that was previously filled?"
+- **Real comp range**: If not posted, ask: "Can you share the budgeted range for this role?" (especially in transparency-law jurisdictions). See the `comp` command for full pre-offer comp coaching and scripts.
+- **Internal candidate preference**: Cannot be determined from the JD. Ask: "Are you also considering internal candidates for this role?"
+
 ## Learn to Read This Yourself
 - **Pattern spotted**: [teaching point from this specific JD]
 - **Trap to watch for**: [common misread this JD exemplifies]
@@ -247,22 +279,22 @@ Based on top competencies and identified gaps:
 
 ## Seniority Calibration
 - JD states: [what level language the JD uses]
-- JD signals: [what level the scope, verbs, and requirements actually target — with evidence]
+- JD signals: [what level the scope, verbs, and requirements actually target — with evidence and confidence]
 - Assessment: [aligned / JD over-titles / JD under-titles] — [Confidence]
 - Implication for candidate: [what this means for their application/positioning]
 
 ## Team Maturity Signals
-[What the JD reveals about the team's stage: process maturity, tech stack age, org structure, growth rate — all with confidence labels. Multiple interpretations where ambiguous.]
+[What the JD reveals about the team's stage — with confidence labels and multiple interpretations where ambiguous.]
 
 ## JD Structural Analysis
 - Quality: [well-written / boilerplate / kitchen-sink / internally contradictory]
-- What quality reveals: [a well-written JD suggests a thoughtful hiring process; a kitchen-sink JD may indicate unclear role definition or committee-written requirements]
+- What quality reveals: [what a well-written vs. poorly written JD signals about the hiring process]
 - Red flags: [if any — internal contradictions, impossible combinations, unrealistic requirements]
 
 ## Challenge (Level 5 only)
-- **Assumptions this decode rests on**: [2-3 assumptions with what would change if they're wrong]
+- **Assumptions this decode rests on**: [2–3 assumptions with what would change if they're wrong]
 - **Blind spots**: [what we fundamentally can't see from a JD alone]
-- **Devil's advocate**: [strongest case that this decode is wrong — specific interpretations that could be off]
+- **Devil's advocate**: [strongest case that this decode is wrong]
 - **Highest-leverage verification**: [the single question that would most change the analysis if answered]
 
 ## Priority Moves (ordered)
@@ -277,29 +309,31 @@ Based on top competencies and identified gaps:
 ```markdown
 ## Batch Triage: [N] Roles
 
-## Ranking
-| Rank | Company | Role | Fit | Strongest Match | Biggest Gap | Recommendation |
+## Pathway-Weighted Ranking
+| Rank | Company | Role | Fit | Pathway | Pathway Boost | Recommendation |
 |---|---|---|---|---|---|---|
-| 1 | ... | ... | Strong | [competency] | [gap or none] | Pursue |
-| 2 | ... | ... | Investable Stretch | [competency] | [gap] | Pursue with positioning |
-| 3 | ... | ... | Long-Shot Stretch | [competency] | [gap] | Skip unless referral |
+| 1 | ... | ... | Strong | Referral | +1 tier | Pursue — prioritize immediately |
+| 2 | ... | ... | Investable Stretch | Warm intro | +0.5 tier | Pursue — address gap [X] before applying |
+| 3 | ... | ... | Strong | Cold | No boost | Pursue — strong fit, standard odds |
+| 4 | ... | ... | Long-Shot Stretch | Referral | +1 tier | Pursue — referral offsets stretch gap |
+| 5 | ... | ... | Long-Shot Stretch | Cold | No boost | Skip unless you develop a connection first |
 
 ## Your Sweet Spot
 [Competencies that appear across multiple JDs — this is what the market values in you. Ordered by frequency across JDs.]
 
 ## Divergence
-[Requirements that pull in different directions — reveals a scope decision the candidate needs to make. E.g., "JDs 1 and 3 want hands-on IC work; JD 2 wants people management. These are different career paths."]
+[Requirements that pull in different directions — reveals a scope decision the candidate needs to make.]
 
 ## Allocation Recommendation
-- **Pursue first**: [which roles, why — highest fit + highest priority]
+- **Pursue first**: [which roles, why — highest pathway-weighted rank]
 - **Pursue with targeted prep**: [which roles, what specific gap to address]
-- **Research more before deciding**: [which roles, what's missing]
-- **Skip or deprioritize**: [which roles, why — honest about structural gaps]
+- **Build pathway before applying**: [roles where cold application is low-odds but access could change the picture — suggest `outreach` to develop a connection]
+- **Skip or deprioritize**: [which roles, why — honest about structural gaps with no pathway offset]
 
 ## Your Market Profile
-[What these JDs collectively reveal about what the market wants from someone like the candidate. This is career intelligence, not just job-search tactics.]
+[What these JDs collectively reveal about what the market wants from someone like the candidate.]
 
-**Recommended next**: `research [top-ranked company]` or `prep [top-ranked company]`. **Alternatives**: `decode` with more JDs, `resume` (JD-targeted optimization for top target)
+**Recommended next**: `research [top-ranked company]` or `prep [top-ranked company]`. **Alternatives**: `decode` with more JDs, `outreach [company]` (to build pathway for cold-access roles worth pursuing)
 ```
 
 ### Coaching State Integration
@@ -311,9 +345,10 @@ Save to coaching_state.md (one entry per JD — multiple can exist):
 - Date: [date]
 - Depth: [Quick Scan / Standard / Deep Decode]
 - Fit verdict: [Strong Fit / Investable Stretch / Long-Shot Stretch / Weak Fit]
+- Pathway: [Referral / Warm intro / LinkedIn connection / Cold / Internal]
 - Top competencies: [top 3 in priority order]
 - Frameable gaps: [list]
 - Structural gaps: [list]
 - Unverified assumptions: [count of LOW/UNKNOWN confidence items]
-- Batch triage rank: [if part of batch — rank/total]
+- Batch triage rank: [if part of batch — rank/total, pathway-weighted]
 ```
diff --git a/references/commands/feedback.md b/references/commands/feedback.md
index faba94e..ca7e96c 100644
--- a/references/commands/feedback.md
+++ b/references/commands/feedback.md
@@ -51,26 +51,15 @@ Classify the candidate's input into one of five types. If ambiguous, ask: "Is th
 - Update Interview Loops → relevant company entry (Status, Rounds completed)
 - Update Interview Intelligence → Question Bank Outcome column for all questions from this company/round
 - If advanced with next-round details, update Interview Loops → Next round
+- If the loop status change affects pipeline priority, sync `state/pipeline.md` per the Stage Sync Protocol (see COACH.md). Specific mappings: Interview Loops Status "Applied" → pipeline Stage "Applied"; "Interviewing" → "Interviewing"; "Closed" (rejection) → "Closed"; "Offer" → "Offer". Also update: Last Activity date to today, and Round count if advancing. Ground truth: if coaching_state.md and pipeline.md contradict, coaching_state.md is authoritative.
 
 **Output**: Brief confirmation of the update. If outcome data now meets the threshold for outcome-score correlation (3+ real interviews), mention it: "You now have enough real interview data for `progress` to show outcome patterns. Worth running when you're ready."
 
-**Calibration trigger**: When the 3-outcome threshold is crossed, note that calibration is now possible: "With 3+ real interview outcomes, the system can now check whether practice scores are predicting real results. Run `progress` to see the calibration analysis." Update Calibration State → Calibration Status to "calibrating" if it was "uncalibrated."
+**Calibration trigger**: When the 3-outcome threshold is crossed, note that calibration is now possible: "With 3+ real interview outcomes, the system can now check whether practice scores are predicting real results. Run `progress` to see the calibration analysis." Update Calibration State → Calibration Status to "calibrating" if it was "uncalibrated" or "practice-calibrating."
 
-#### Rejection Leverage (Level 5 only)
+#### Rejection Leverage
 
-When the outcome is a rejection at Level 5, don't lead with comfort. Lead with extraction: "What can we extract from this?"
-
-Run Challenge Protocol Lenses 1-3 retrospectively:
-1. **Assumptions**: What assumptions were wrong about this company/role/interview? What did you believe going in that turned out not to be true?
-2. **Blind Spots**: What does this rejection reveal that you couldn't see before? What pattern is now visible?
-3. **Pre-Mortem (retrospective)**: With hindsight, what was the pre-mortem you should have done? What failure modes were predictable?
-
-Then:
-- **Concrete adjustments** for the next similar interview
-- **Pattern detection**: Does this match other rejections in the Outcome Log? If so, name the pattern.
-- **Close**: "Rejection is data. This data says [specific insight]. Here's what we do with it."
-
-At Levels 1-4: Standard emotional triage from the Psychological Readiness Module in `references/cross-cutting.md`. Learning extraction follows empathy, not leads.
+When the outcome is a rejection, the Challenge Protocol applies based on the candidate's directness level. See `references/challenge-protocol.md` → Rejection Leverage section for the graduated approach (Level 3 = retrospective assumptions only; Level 4 = assumptions + blind spots; Level 5 = full 3-lens retrospective). At Levels 1–2: standard emotional triage from the Psychological Readiness Module in `references/cross-cutting.md`.
 
 ---
 
@@ -99,20 +88,35 @@ At Levels 1-4: Standard emotional triage from the Psychological Readiness Module
 
 **Trigger**: Candidate remembers a question, story detail, interviewer behavior, or other interview data after the debrief or analysis session has closed.
 
+**The core problem with Type D**: Memory degrades fast. A candidate who remembers a question 3 days after the interview has a partial and reconstructed memory — they may remember the topic but not the exact wording, or the general thrust but not the follow-up. Capturing thin memories is still valuable (the topic is real), but they must be tagged with their reliability so downstream workflows don't treat recalled-days-later data the same as same-day data.
+
 **Capture process**:
 1. Identify what type of information it is:
    - A question they forgot during debrief → route to Question Bank
    - A story detail or new story → route to Storybank (suggest `stories` for full development)
    - An interviewer signal they remembered → route to Interview Loops
    - A company culture observation → route to Company Patterns
-2. Capture it in the appropriate section.
+
+2. **Before accepting thin input, use guided elicitation.** If the candidate's memory is vague ("I think they asked something about conflict?"), don't just capture "conflict question — vague." Try to sharpen it:
+   - "What part of the question do you remember most clearly — the topic, the way they phrased it, or what they seemed to be after?"
+   - "Did they give you any context before asking, or was it direct?"
+   - "What was your first instinct when you heard it — what did you think they were really testing?"
+   - "Did you feel like you answered it well, or was it one that caught you off guard?"
+   These prompts don't manufacture memory — they surface what's actually there. If nothing more comes out, capture what exists and tag accordingly.
+
+3. **Tag with memory reliability**:
+   - **Same day**: High reliability — candidate captured this within hours. Treat as close to first-hand data.
+   - **1–2 days later**: Medium reliability — topic and general intent are likely accurate; exact wording less so. Note: "Recalled 1–2 days post-interview — topic reliable, wording reconstructed."
+   - **3+ days later**: Low reliability — capture the topic and any details the candidate is confident about, but flag explicitly: "Recalled 3+ days post-interview — treat as directional. Do not weight exact wording in prep."
+
+4. For questions routed to the Question Bank, add the memory reliability tag to the entry.
 
 **State updates**:
 - Route to the appropriate section as identified above
-- If it's a question, add to Interview Intelligence → Question Bank with score "recall-only"
+- If it's a question, add to Interview Intelligence → Question Bank with score "recall-only" AND memory reliability tag (Same-day / 1–2 days / 3+ days)
 - If it changes a previous assessment meaningfully, flag it
 
-**Output**: Brief confirmation of where the information was captured. If it changes something meaningful, say so. If it would benefit from further development, suggest the relevant command: "Captured that question. If you want to prep an answer for it, `practice` can drill you on it."
+**Output**: Brief confirmation of where the information was captured AND the reliability tag. "Captured that question in the Question Bank — tagged as 'recalled 3+ days later, directional only.' It's useful for prep context, but treat the wording as approximate." If it would benefit from further development, suggest the relevant command.
 
 ---
 
@@ -139,5 +143,6 @@ At Levels 1-4: Standard emotional triage from the Psychological Readiness Module
 
 - **Capture first, analyze later.** Feedback captures data; `analyze`, `progress`, and `prep` are where that data becomes actionable. Don't over-analyze in the moment — confirm what was captured and move on.
 - **Flexible output.** There's no fixed output schema for `feedback`. The confirmation adapts to the input type — sometimes it's one line, sometimes it's a paragraph. Match the weight of the output to the weight of the input.
+- **Memory reliability is data.** A question recalled 3 days later is still worth capturing — the topic is real. But treating it with the same weight as a same-day recall misleads downstream prep. Tag everything Type D with reliability.
 - **Optional next step.** After capturing, suggest the most natural next command if one is relevant. Don't force it.
 - **Don't duplicate existing workflows.** If the candidate starts a full debrief during `feedback`, redirect: "This sounds like a full debrief — want to switch to `debrief` so we capture everything systematically?" Same for detailed corrections that become re-analysis — redirect to `analyze`.
diff --git a/references/story-mapping-engine.md b/references/story-mapping-engine.md
index 7e838a2..fccaf60 100644
--- a/references/story-mapping-engine.md
+++ b/references/story-mapping-engine.md
@@ -20,10 +20,10 @@ Replace bare `Q1 -> S###` with a 4-level fit classification:
 When evaluating a story-question match, weigh these factors in order:
 
 1. **Competency match** (highest weight) — Primary skill match > Secondary skill match > Reframe match. A story whose primary skill directly addresses the tested competency is always preferred.
-2. **Strength score** (high weight) — Stories rated 4-5 > 3 > 2. A strength-5 story with a secondary skill match may outperform a strength-3 story with a primary skill match.
+2. **Strength score** (high weight) — Stories rated 4–5 > 3 > 2. A strength-5 story with a secondary skill match may outperform a strength-3 story with a primary skill match.
 3. **Company/role alignment** (medium weight) — Does the story's domain match the target company? Is the earned secret relevant to what this company values? Stories from the same industry or with transferable context get a boost.
-4. **Freshness** (medium weight) — Has this story been used in prior rounds at this company? Has it been used 3+ times in the current job search? Fresh stories signal range.
-5. **Variety** (portfolio constraint) — Applied at the portfolio level, not per-question. Penalizes using the same story twice in one interview prep.
+4. **Freshness** (medium weight) — Story freshness score (see Section 5). Stale stories are deprioritized when alternatives exist. Proven performers (stories that contributed to past advancements) get a boost that partially offsets staleness.
+5. **Variety** (portfolio constraint) — Applied at the portfolio level, not per-question. See Section 2 for the framing variety constraint — the same story can appear twice in a mapping with demonstrably different angles.
 
 ### Mapping Output Format
 
@@ -31,6 +31,7 @@ For each question-story mapping, state:
 - **Fit level**: Strong Fit / Workable / Stretch / Gap
 - **Why**: One line explaining the match (e.g., "Primary skill (leadership) directly matches competency. Strength 4. Domain aligned (B2B SaaS).")
 - **Bridging guidance** (Workable/Stretch only): How to frame the story to better address the competency. (e.g., "Foreground the cross-functional coordination element — it's a secondary skill in this story but it's what the question is testing.")
+- **Freshness flag** (if applicable): Note if story is flagged as stale or over-rehearsed.
 
 ---
 
@@ -63,10 +64,26 @@ For each conflict:
 3. Cascade to the next-best story for the losing question.
 4. Flag significant downgrades: "Q4 was downgraded from S003 (Strong Fit) to S006 (Workable) due to conflict with Q3. Bridging guidance: [specific framing]."
 
-### Step 4: Apply Variety Constraint
-No story should appear more than once in the final mapping unless no alternative exists. If a story must be reused:
-- Explain why: "S003 is the only story addressing both leadership and prioritization competencies. No alternative exists for Q4."
-- Suggest framing variation: "For Q3, lead with the decision-making angle. For Q4, lead with the stakeholder management angle."
+### Step 4: Apply Framing Variety Constraint
+
+The same story may appear twice in the final mapping if — and only if — the framing angle for each appearance is demonstrably different, and both angles are substantive parts of the story (not a dominant thread vs. a minor subplot).
+
+**Why this replaces the "no story twice" rule**: A strong story with two genuine angles produces better outcomes than forcing a weaker story into one of the slots. One strong story told twice with different leads is more competitive than two mediocre stories. Variety for its own sake is a trap.
+
+**When to allow a story to appear twice**:
+- Specify the distinct angle for each: "For Q3, lead with the stakeholder alignment dimension. For Q7, lead with the technical constraint navigation." Both must be explicitly named and genuinely distinct.
+- Confirm both angles are substantive using the **Substantiation Test** (all 3 criteria must hold):
+  1. **Different competency tested**: Does Q3's angle test a different core competency than Q7's angle? If both angles test "leadership" (even labeled differently), they're relabeling. If one tests "stakeholder management" and the other tests "technical constraint navigation," they're distinct.
+  2. **Different STAR section emphasized**: Do the two angles draw from materially different parts of the story? If both lead with the same Action section relabeled, that's cosmetic. If one foregrounds the Situation setup and the other foregrounds the Result or a different Action thread, they're distinct.
+  3. **Different interviewer takeaway**: Would an interviewer who heard both versions learn two materially different things about the candidate? If yes — substantive reuse. If no — relabeling.
+- Note the reuse in the mapping: "S003 appears twice — distinct angles. Q3: stakeholder angle. Q7: constraint navigation angle."
+
+**When to NOT allow reuse**:
+- If the angle difference is cosmetic ("for Q3, I'll say leadership; for Q7, I'll say influence" — these are the same thing relabeled)
+- If a Workable or better alternative exists for one of the two questions — use it instead
+- If the story appears in a prior round at this company (freshness constraint overrides angle variety)
+
+**If a story must be reused with the same framing angle** (no alternatives exist at any fit level): explain why and flag the limitation: "S003 is the only story addressing both [competency A] and [competency B] at any viable fit level. Reusing with same angle — limited range signal."
 
 ### Step 5: Apply Freshness Constraint
 Check `coaching_state.md` → Interview Loops for stories used in prior rounds at this company.
@@ -74,9 +91,10 @@ Check `coaching_state.md` → Interview Loops for stories used in prior rounds a
 - Flag: "S003 was used in Round 1. Using it again in Round 2 signals limited range unless they specifically ask you to elaborate."
 
 ### Step 6: Apply Overuse Check
-Flag stories used 3+ times in the current job search (check Use Count in storybank).
-- 3 uses: "S007 has been used in 3 interviews. Consider rotating to a fresher story if alternatives exist."
-- 5+ uses: "S007 is heavily used (5 times). Interviewers in your network may have heard it. Prioritize alternatives."
+Cross-reference story freshness scores (Section 5) and use counts:
+- Stories flagged as "Stale" or "Overuse Risk": deprioritize when alternatives exist. Don't eliminate — a Stale Strong Fit still outperforms a Fresh Gap.
+- Stories flagged as "Proven Performer": boost priority. Past advancement contribution partially offsets staleness.
+- When a stale story is selected because no alternative exists: note it and recommend developing a new story for this competency.
 
 ### Step 7: Output Final Mapping
 Produce the final mapping with annotations (see Output Schema below).
@@ -94,7 +112,7 @@ When company culture signals prize differentiation (e.g., companies known for "b
 Example: S005 (Workable, strong earned secret) competes with S008 (Workable, no earned secret). Under the conditional boost, S005 is treated as Strong Fit equivalent.
 
 ### When Calibration Confirms
-If `coaching_state.md` → Calibration State shows that Differentiation correlates with advancement for this candidate, upgrade this from conditional to default: always prefer stories with stronger earned secrets.
+If `coaching_state.md` → Calibration State → Scoring Drift Log shows that Differentiation predicts advancement for this candidate (i.e., advancements correlate with higher Differentiation scores and rejections correlate with lower), upgrade this from conditional to default: always prefer stories with stronger earned secrets.
 
 ---
 
@@ -109,6 +127,59 @@ Secondary skill matches are always Workable at best — never Strong Fit — bec
 
 ---
 
+## Section 5: Story Freshness Score
+
+**The problem**: Use count alone is a blunt instrument. A story used 6 times 8 months ago is functionally fresh. A story used 3 times in the last 2 weeks is overcooked. A story about a project from 6 years ago may feel dated regardless of use count. Freshness needs to account for all three dimensions.
+
+### Three Freshness Factors
+
+Assess each factor independently, then combine into a verdict:
+
+**Factor 1 — Use Frequency Risk**
+How often has this story been practiced or used in interviews, and how recently?
+- **Low risk**: Used 1–2 times total, OR last use was 3+ months ago regardless of count
+- **Medium risk**: Used 3–4 times, with at least one use in the last 2 months
+- **High risk**: Used 5+ times in the last 3 months — over-rehearsal territory. Delivery may feel mechanical.
+
+**Factor 2 — Recency of Underlying Experience**
+How old is the event the story describes?
+- **Low risk**: Experience is < 3 years old — well within "recent and relevant" territory for most roles
+- **Medium risk**: Experience is 3–5 years old — still usable, but may need updated framing or a bridge to current context
+- **High risk**: Experience is 5+ years old — can feel dated. Exception: career-defining or uniquely powerful stories that have no current equivalent. These should be flagged but not automatically deprioritized.
+
+**Factor 3 — Proven Performer Status**
+Has this story contributed to real interview advancement?
+- **Yes**: Partially offsets High risk on other factors. A story with a demonstrated track record of working is worth preserving even if it's been used frequently.
+- **No**: No offset. Standard freshness rules apply.
+
+### Freshness Verdict
+
+| Factor Combination | Verdict | Action |
+|---|---|---|
+| 2+ High risk factors, no Proven Performer offset | **Stale** | Deprioritize when alternatives exist. Flag: "This story may feel mechanical or dated. Consider developing a replacement." |
+| 1 High risk factor OR 2 Medium risk factors | **Moderate** | Use with awareness. Flag the specific risk: "Use frequency is high — focus on natural delivery." or "Experience is aging — consider bridging to a more recent example if available." |
+| All Low risk (with or without Proven Performer) | **Fresh** | No action needed. |
+| Any combo with Proven Performer offset | **Fresh (Proven)** | Use with confidence. "This story has a track record. Its use history is an asset, not a liability." |
+
+### Recording in Storybank
+
+Add a `Freshness` field to each storybank entry, updated after each use:
+
+```
+Freshness: [Fresh / Fresh (Proven) / Moderate / Stale]
+Last assessed: [date]
+```
+
+Freshness should be re-assessed at the start of each prep cycle, not just when a story is selected. A story that was Moderate 3 months ago is likely Fresh again now.
+
+### When Staleness is Acceptable
+
+- **No alternative exists**: A Stale Strong Fit outperforms a Fresh Gap every time. Use the stale story, note the risk, and recommend developing a replacement.
+- **Proven Performer**: If the story worked in past interviews, its familiarity is partially an asset — the candidate tells it with confidence. Monitor for mechanical delivery, not just use count.
+- **Career-defining stories**: Some stories are irreplaceable — a founding moment, a breakthrough that defines the candidate's narrative identity. These should not be rotated out just because they're old. Instead, help the candidate keep them current with updated framing.
+
+---
+
 ## Output Schema
 
 Use this schema in `prep` output to replace the current simple story mapping:
@@ -117,22 +188,22 @@ Use this schema in `prep` output to replace the current simple story mapping:
 ## Story Mapping
 
 ### Mapping Matrix
-| Question | Primary Story | Fit | Backup Story | Fit | Notes |
-|----------|--------------|-----|--------------|-----|-------|
-| Q1: [question summary] | S### — [title] | Strong Fit | S### — [title] | Workable | |
-| Q2: [question summary] | S### — [title] | Workable | S### — [title] | Stretch | Bridging: [guidance] |
-| Q3: [question summary] | Gap | — | S### — [title] | Stretch | Gap-handling: Pattern 2 |
-...
+| Question | Primary Story | Fit | Freshness | Backup Story | Fit | Notes |
+|----------|--------------|-----|-----------|--------------|-----|-------|
+| Q1: [question summary] | S### — [title] | Strong Fit | Fresh | S### — [title] | Workable | |
+| Q2: [question summary] | S### — [title] | Workable | Moderate | S### — [title] | Stretch | Bridging: [guidance] |
+| Q3: [question summary] | Gap | — | — | S### — [title] | Stretch | Gap-handling: Pattern 2 |
+| Q4: [question summary] | S003 — [title] | Strong Fit | Stale | S### — [title] | Workable | Reusing S003 — no Workable+ alternative. Consider developing replacement. |
 
 ### Portfolio Health
 - Unique stories used: [N] of [M] mapped questions
 - Conflicts resolved: [e.g., "Q3 and Q4 competed for S003 — assigned to Q3 (higher fit), Q4 uses S006"]
-- Strength warnings: [stories rated <3 that appear in mapping — specific guidance for each]
-- Freshness warnings: [stories used in prior rounds at this company]
-- Overuse warnings: [stories used 3+ times in current search]
+- Reuse with different angle: [if any — e.g., "S003 used twice: Q1 stakeholder angle, Q4 constraint navigation angle"]
+- Freshness warnings: [stories flagged Stale or Moderate — specific guidance for each]
+- Prior round conflicts: [stories used in earlier rounds at this company]
 
 ### Gaps
-- [Competency]: best available is [story] ([fit level]). Gap-handling: [Pattern 1-4]. Consider developing a new story for this competency.
+- [Competency]: best available is [story] ([fit level]). Gap-handling: [Pattern 1–4]. Consider developing a new story for this competency.
 
 ### Strength Warnings
 - [Question] -> [Story]: rated strength [N]. [Specific guidance — e.g., "This story needs quantified impact before deployment. Run `stories improve S###` to strengthen it."]
@@ -142,18 +213,20 @@ Use this schema in `prep` output to replace the current simple story mapping:
 
 ## Integration Points
 
-### With Calibration Engine (references/calibration-engine.md)
+### With Calibration Engine (`references/calibration-engine.md`)
 - When scoring drift adjusts a dimension, flag stories whose strength ratings were driven by that dimension for re-evaluation.
 - When calibration shows Differentiation predicts advancement, upgrade earned-secret-aware selection from conditional to default.
 - When calibration links a specific dimension to rejections, elevate story mapping gaps in that dimension's competencies to "Calibration-Urgent" priority.
 
-### With Prep (references/commands/prep.md)
+### With Prep (`references/commands/prep.md`)
 - Prep Step 7 runs a storybank health check before mapping.
 - Prep Step 8 invokes this engine for the full mapping protocol.
 
-### With Stories (references/commands/stories.md)
+### With Stories (`references/commands/stories.md`)
 - Gap analysis in `stories find gaps` uses the fit scoring system to classify gaps.
 - Secondary skills are checked for coverage before declaring a competency a true gap.
+- During `stories find gaps`, the Gap classification triggers guided story development — the memory excavation protocol for that specific competency (see `stories` command, sub-option: Find gaps).
 
-### With Progress (references/commands/progress.md)
-- Storybank health metrics include overuse tracking and freshness risk.
+### With Progress (`references/commands/progress.md`)
+- Storybank health metrics include overuse tracking, freshness risk, and proven performer status.
+- Story freshness scores are re-assessed at the start of each progress review cycle.
diff --git a/references/transcript-formats.md b/references/transcript-formats.md
index 838f74d..ba57638 100644
--- a/references/transcript-formats.md
+++ b/references/transcript-formats.md
@@ -2,14 +2,14 @@
 
 ## Purpose
 
-Step 0.5 in the transcript processing pipeline. Runs before cleaning (Step 1). Detects the source tool/format of a raw transcript and normalizes it to a standard internal representation.
+Runs before transcript cleaning in the processing pipeline (see `references/transcript-processing.md` for full sequence). Detects the source tool/format of a raw transcript and normalizes it to a standard internal representation. When invoking from `analyze`, this step runs first — before any content-level cleaning begins.
 
 ## Format Detection Protocol
 
-- Examine the first 30-50 lines for format signals
+- Examine the first 30–50 lines for format signals
 - Check for: VTT headers (`WEBVTT`), timestamp styles (HH:MM:SS vs MM:SS vs inline), speaker label patterns (Name:, Speaker 1:, etc.), topic/chapter headers, paragraph grouping, line numbering, metadata blocks
-- Support 8 formats (detailed below)
-- When detection is uncertain, default to Manual/generic processing and note: "Format not confidently detected — processing as generic transcript. Results may require manual review."
+- Support 9 formats (detailed below)
+- When detection is uncertain, attempt disambiguation rules before defaulting. If disambiguation still fails, default to Manual/generic processing and note: "Format not confidently detected — processing as generic transcript. Results may require manual review."
 
 ## Supported Formats
 
@@ -18,7 +18,6 @@ Step 0.5 in the transcript processing pipeline. Runs before cleaning (Step 1). D
 **Detection signals**: Speaker names on their own line followed by paragraph-grouped text. Timestamps at paragraph level (not per-line). Often includes "Transcribed by Otter" footer.
 
 **Normalization rules**:
-
 - Merge paragraph-grouped text under the same speaker into a single turn
 - Watch for mid-turn misattribution: Otter sometimes reassigns speaker mid-paragraph when background noise triggers speaker detection. If a short (< 10 word) segment attributed to a different speaker appears mid-paragraph, keep it with the surrounding speaker and flag: "[possible misattribution]"
 - Strip paragraph-level timestamps
@@ -29,17 +28,15 @@ Step 0.5 in the transcript processing pipeline. Runs before cleaning (Step 1). D
 **Detection signals**: Topic/chapter headers (often bold or with section markers). Speaker labels with timestamps. Structured sections with headings.
 
 **Normalization rules**:
-
 - Preserve topic headers as `[Topic: ...]` markers — these carry valuable structural information for format-aware parsing
 - Don't split Q&A exchanges at topic boundaries — if a question starts in one topic section and the answer continues in the next, keep them together
 - Merge consecutive lines from the same speaker
 
 ### 3. Google Meet
 
-**Detection signals**: Per-line speaker labels with timestamps. Very aggressive line splitting — same speaker may have 5-10 consecutive short lines. Format: "Speaker Name HH:MM:SS" or similar.
+**Detection signals**: Per-line speaker labels with timestamps. Very aggressive line splitting — same speaker may have 5–10 consecutive short lines. Format: "Speaker Name HH:MM:SS" or similar.
 
 **Normalization rules**:
-
 - Merge aggressive per-line splitting for the same speaker — consecutive lines from the same speaker become one turn
 - Strip per-line timestamps
 - Reconstruct sentence-level text from fragments
@@ -49,30 +46,31 @@ Step 0.5 in the transcript processing pipeline. Runs before cleaning (Step 1). D
 **Detection signals**: `WEBVTT` header on first line. Sequential numbered cues. Timestamps in `HH:MM:SS.mmm --> HH:MM:SS.mmm` format. May include positioning metadata (`align:`, `position:`).
 
 **Normalization rules**:
-
 - Strip the WEBVTT header and all metadata lines
 - Strip cue numbering and timestamp lines
 - Extract speaker labels from text content if present (often formatted as "Speaker Name: text" within the cue)
 - Handle positioning metadata by ignoring it entirely
 - Merge consecutive cues from the same speaker
 
+**Zoom Webinar distinction**: Zoom Webinar transcripts use the same VTT format but may have different speaker label conventions (panelists vs. attendees labeled differently). If speaker labels include "Panelist:" prefixes, note: "Zoom Webinar format detected — panelist speakers labeled. Attendee questions may appear as unlabeled or host-relayed." Apply standard VTT normalization, preserve panelist labels.
+
 ### 5. Granola
 
 **Detection signals**: AI-generated meeting notes format with structured sections (Summary, Key Points, Action Items, Transcript). Speaker labels with timestamps. May include AI-generated summaries above the raw transcript.
 
 **Normalization rules**:
-
 - Locate the raw transcript section (often after AI-generated notes)
 - Strip AI-generated summary/notes sections but note their existence: "[AI notes section detected — using raw transcript below]"
 - Merge same-speaker consecutive lines
 - Preserve action items as `[Action Item: ...]` markers if they appear inline
 
+**Granola quality signal**: Granola's AI summary section can be used as a quality check. If the AI summary mentions a topic that doesn't appear anywhere in the raw transcript section, flag: "[Granola summary references topic not found in raw transcript — possible transcript truncation or AI hallucination in notes]." Don't use the AI summary as a substitute for the raw transcript in analysis.
+
 ### 6. Microsoft Teams
 
 **Detection signals**: Speaker labels on separate lines from text content. Timestamps in their own column or line. May include echo artifacts (same text appearing twice). Format often resembles a table or structured block.
 
 **Normalization rules**:
-
 - Combine separate speaker-label lines with their associated text blocks
 - Handle echo artifacts: if the same text appears twice within 2 lines attributed to the same speaker, deduplicate
 - Strip timestamp columns/lines
@@ -83,17 +81,28 @@ Step 0.5 in the transcript processing pipeline. Runs before cleaning (Step 1). D
 **Detection signals**: Line numbers (1, 2, 3...) at the start of lines. Timestamps alongside speaker labels. Clean formatting with consistent structure.
 
 **Normalization rules**:
-
 - Strip line numbers from the start of each line
 - Strip timestamps
 - Merge consecutive same-speaker lines
 
-### 8. Manual/Generic
+### 8. Fireflies.ai
 
-**Detection signals**: None of the above patterns match, OR the format is ambiguous. Includes manually typed transcripts, notes, or unstructured text.
+**Detection signals**: Speaker labels with timestamps, formatted as "[HH:MM] Speaker Name: text". Often includes a summary header block (Meeting Title, Date, Participants). May include automated action items and chapter markers.
 
 **Normalization rules**:
+- Strip the summary header block entirely (Meeting Title, Date, Participants metadata) — note: "[Fireflies.ai header detected and stripped]"
+- Strip timestamps from each turn
+- Preserve chapter/topic markers as `[Topic: ...]` markers (same as Grain)
+- Strip automated action items from the body of the transcript; preserve as `[Action Item: ...]` markers only if they appear inline within a speaker's turn
+- Merge consecutive same-speaker turns
+
+**Distinguishing from Grain**: Both have topic headers. Fireflies uses `[HH:MM] Name: text` per-line format; Grain uses paragraph grouping under speaker with topic headers above. If the format is per-line with timestamps, treat as Fireflies.
 
+### 9. Manual/Generic
+
+**Detection signals**: None of the above patterns match, OR the format is ambiguous. Includes manually typed transcripts, notes, or unstructured text.
+
+**Normalization rules**:
 - Infer speakers from context clues: Q:/A: labels, quotation marks, paragraph breaks, "Interviewer:"/"Candidate:" labels, indentation patterns
 - If no speaker labels can be inferred, ask the candidate: "I can't identify speaker turns in this transcript. Can you indicate which parts are your answers and which are the interviewer's questions?"
 - Preserve paragraph structure as turn boundaries when no other signals exist
@@ -105,6 +114,8 @@ When format signals overlap:
 - **Otter vs. Google Meet**: Otter groups text into paragraphs under a speaker; Google Meet has one short line per timestamp per speaker. If average line length > 50 words, likely Otter. If < 15 words, likely Google Meet.
 - **Grain vs. Otter**: Grain has explicit topic/chapter headers that structure the transcript. If topic headers exist, treat as Grain. If only speaker labels and paragraphs, treat as Otter.
 - **Teams vs. Otter**: Teams separates speaker labels from text blocks (label on one line, text on the next). Otter puts the speaker label inline or at the top of a paragraph group. If speaker labels are consistently isolated on their own lines, treat as Teams.
+- **Fireflies vs. Grain**: Both have topic headers. Fireflies is per-line with `[HH:MM] Name:` pattern; Grain groups paragraphs under speaker names. If timestamp + speaker label appears on every content line, treat as Fireflies.
+- **Granola vs. Grain**: Both may have AI-generated structure. Granola has a distinct multi-section header (Summary, Key Points, Action Items) before the raw transcript. Grain goes directly into speaker-labeled content with topic markers. If a structured multi-section header is present, treat as Granola.
 
 ## Internal Representation (Normalization Target)
 
@@ -117,7 +128,6 @@ All formats normalize to:
 ```
 
 Rules:
-
 - One speaker turn per block
 - Blank line between turns
 - Timestamps stripped entirely
@@ -137,8 +147,23 @@ Rules:
 
 After normalization, report these quality signals to the Quality Gate:
 
-- **Speaker label coverage**: What percentage of text blocks have identified speakers? (100% = high, <80% = flag)
+- **Format identified**: Which of the 9 formats was detected, and at what confidence (High / Medium / Low / Defaulted to Manual).
+- **Speaker label coverage**: What percentage of text blocks have identified speakers? (100% = high, < 80% = flag)
 - **Normalization confidence**: How confident is the format detection? (High = clear format match, Medium = some ambiguity, Low = defaulted to generic)
 - **Multi-speaker detection**: Number of distinct speakers detected and whether roles could be assigned
 - **Content preservation**: Estimated percentage of original content preserved through normalization (fidelity markers, topic headers, etc.)
-- **Artifacts detected**: Any echo artifacts, misattributions, or garbled sections identified during normalization
+- **Artifacts detected**: Any echo artifacts, misattributions, garbled sections, or AI-summary-vs-transcript discrepancies (Granola) identified during normalization
+
+## Format Fingerprint for Ambiguous Cases
+
+When disambiguation rules don't resolve the format, generate a **format fingerprint** before defaulting to Manual:
+
+1. Average line length (words per line)
+2. Presence of timestamps (yes/no, and format if yes)
+3. Speaker label position (inline / own line / absent)
+4. Section/chapter headers present (yes/no)
+5. Header block before transcript body (yes/no)
+
+Report the fingerprint alongside the detection decision: "Format fingerprint: avg 12 words/line, per-line timestamps `[HH:MM]`, speaker labels inline, no section headers, no header block. Closest match: Fireflies.ai (medium confidence). Proceeding as Fireflies — review if output looks incorrect."
+
+This gives the candidate an explicit basis to override the detection if they know the actual source tool.