🧪 Testing Guide

This document describes how to verify the Havoc Hackathon skill works correctly.

Since this is a conversational AI skill (not traditional code), testing is done through conversation playbooks - scripted interactions that verify expected behavior.

🎮 How to Test Locally

Register the skill in a Copilot CLI session:
```
/skills add ./
```
Run each playbook below and verify the expected behavior.
Check the QA checklist at the bottom before submitting a PR.

📋 Conversation Playbooks

Playbook 1: Basic Code Hackathon

Step	You Say	Expected Behavior
1	`run hackathon - write a fizzbuzz function`	Opening ceremony with arena banner, contestants, rubric
2	(accept defaults or customize)	3 models dispatched in parallel with progress commentary
3	(wait for completion)	All models finish, outputs normalized
4	(judging phase)	Sealed panel scores anonymized submissions
5	(results)	Drumroll → winner reveal → ASCII podium → ELO update

Playbook 2: Review Mode

Step	You Say	Expected Behavior
1	`run hackathon - review @src/app.js for security issues`	Detects review mode, adjusts rubric
2	(models complete)	Each model produces structured findings
3	(results)	Ensemble report showing consensus findings

Playbook 3: Custom Models

Step	You Say	Expected Behavior
1	`hackathon with opus and gemini - refactor this function`	Only 2 models dispatched (head-to-head mode)
2	(results)	Head-to-head comparison, no bracket/tournament

Playbook 4: Model Failure

Step	You Say	Expected Behavior
1	`run hackathon` on a complex task	If a model fails, it retries once
2	(second failure)	Model DQ'd with flair ("💀 ELIMINATED")
3	(remaining models)	Hackathon continues with surviving contestants

Playbook 5: ELO Persistence

Step	You Say	Expected Behavior
1	Run hackathon #1	ELO table created, initial ratings set
2	Run hackathon #2	ELO from previous run shown, ratings updated
3	`show leaderboard`	Current ELO rankings displayed

Playbook 6: Smart Merge

Step	You Say	Expected Behavior
1	Complete a build hackathon	Merge options presented
2	Select "Smart merge"	Best components cherry-picked from each submission
3	(verify)	Build passes, tests pass after merge

Playbook 7: Rematch Mode

Step	You Say	Expected Behavior
1	Run a hackathon that finishes close (≤2 pts margin)	Offered "🔥 Want a rematch with a tiebreaker?"
2	Accept and pick a 6th criterion	Re-judging on new criterion only
3	(results)	Combined scores reveal final winner

Playbook 8: Replay Export

Step	You Say	Expected Behavior
1	Complete any hackathon	Offered "📼 Want the highlight reel?"
2	Accept	Markdown file saved with full transcript
3	(verify)	File contains banner, submissions, scores, podium

Playbook 9: Post-Match Analytics

Step	You Say	Expected Behavior
1	Run 2+ hackathons in one session	Performance data accumulated
2	`show stats` or `show leaderboard`	Model trends, win rates, ASCII bar charts
3	(verify)	Per-model breakdown by task type shown

Playbook 10: Persistent ELO

Step	You Say	Expected Behavior
1	Run a hackathon	ELO saved to `~/.copilot/hackathon-elo.json`
2	Start a new Copilot CLI session	ELO loaded from JSON file into SQL
3	Run another hackathon	Previous ELO ratings shown in Phase 0

Playbook 11: Model Tier Selection

Step	You Say	Expected Behavior
1	`run hackathon - write a haiku`	Prompted: "⚡ Model tier? Standard or Premium"
2	Select "Standard"	Standard contestants and judges used, ⚡ badges shown
3	`run hackathon with premium models - write a haiku`	No tier prompt, premium models used directly, 👑 badges shown
4	`hackathon with opus and gemini - write a haiku`	No tier prompt, named models used directly

Playbook 12: Stall Detection

Step	You Say	Expected Behavior
1	Run a hackathon on a complex task	Models dispatched, progress monitored
2	(model stalls for 180s)	Prompted: "⏳ {Model} has been silent for 3 minutes. Want to keep waiting or DQ?"
3	Select "Keep waiting (60s more)"	Timer extended by 60 seconds
4	(model stalls again)	Auto-DQ with commentary: "💀 {Model} went AFK. No mercy in this arena."
5	(alternative: select "DQ and continue")	Model DQ'd immediately, hackathon continues with remaining contestants

Playbook 13: Tournament Mode (Default)

Step	You Say	Expected Behavior
1	`run hackathon - refactor this module`	Tournament mode activates by default, all available models enter
2	(Round 1)	Models grouped into heats (elastic brackets based on count). Each heat runs in parallel.
3	(heat judging)	Per-heat judge panels (3 judges each) score independently in parallel
4	(heat results)	Mini-ceremony per heat winner: "🏅 {Model} takes Heat {N}!"
5	(Round 2)	Finalists dispatched with Evolution Brief prepended
6	(finals judging + results)	Full ceremony, tournament bracket recap in closing

Playbook 14: Evolution Brief

Step	You Say	Expected Behavior
1	Complete Round 1 of a tournament hackathon	Evolution Brief generated from judge scores
2	(verify)	Brief includes: winning strategies per heat, top scoring categories, key differentiators
3	(Round 2)	Finalists receive the brief — commentary: "🧬 Finalists have studied the playbook."

Playbook 15: Ensemble Synthesis

Step	You Say	Expected Behavior
1	Complete a tournament hackathon (build mode)	Merge options presented with "Ensemble synthesis" as recommended
2	Select "Ensemble synthesis"	Integrator agent analyzes ALL finalist submissions
3	(verify)	Output shows CONSENSUS/MAJORITY/UNIQUE classifications with provenance annotations

Playbook 16: Classic Mode Fallback

Step	You Say	Expected Behavior
1	`run hackathon quick - write a haiku`	Classic mode: 3 contestants, no heats, single round
2	(verify)	Identical behavior to v1.x — no tournament, no evolution brief
3	`run hackathon fast - name this variable`	Same classic mode behavior

Playbook 17: Grid-Medic Escalation

Step	You Say	Expected Behavior
1	Run grid-medic `diagnose` on an agent scoring ≤5/10	Grid-medic detects critical underperformance
2	(escalation prompt)	"🚨 {agent} is critically underperforming. Escalate to Havoc Hackathon?"
3	Select "Escalate to Hackathon"	Havoc Hackathon triggered to competitively rewrite the agent
4	(after hackathon completes)	Grid-medic re-diagnoses the upgraded agent, logs score delta

✅ QA Checklist

Before submitting a PR, verify:

🔍 YAML/Markdown Validation

Verify catalog metadata:

# Check YAML syntax
python3 -c "import yaml; yaml.safe_load(open('skills/havoc-hackathon/catalog.yml'))" && echo "✅ YAML valid"

# Check required fields
python3 -c "
import yaml
d = yaml.safe_load(open('skills/havoc-hackathon/catalog.yml'))
required = ['schema_version','id','name','description','emoji','codename','category']
missing = [f for f in required if f not in d]
print('✅ All required fields present' if not missing else f'❌ Missing: {missing}')
"

📊 Coverage Matrix

Feature	Playbook	Status
Basic code hackathon	1	🧪
Review mode	2	🧪
Custom model selection	3	🧪
Model failure & DQ	4	🧪
ELO persistence	5	🧪
Smart merge	6	🧪
Rematch mode	7	🧪
Replay export	8	🧪
Post-match analytics	9	🧪
Persistent ELO (cross-session)	10	🧪
Model tier selection	11	🧪
Stall detection	12	🧪
Tournament bracket	-	🧪
Adaptive rubrics	-	🧪
Tournament mode (default)	13	🧪
Evolution brief	14	🧪
Ensemble synthesis	15	🧪
Classic mode fallback	16	🧪
Grid-medic escalation	17	🧪

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧪 Testing Guide

🎮 How to Test Locally

📋 Conversation Playbooks

Playbook 1: Basic Code Hackathon

Playbook 2: Review Mode

Playbook 3: Custom Models

Playbook 4: Model Failure

Playbook 5: ELO Persistence

Playbook 6: Smart Merge

Playbook 7: Rematch Mode

Playbook 8: Replay Export

Playbook 9: Post-Match Analytics

Playbook 10: Persistent ELO

Playbook 11: Model Tier Selection

Playbook 12: Stall Detection

Playbook 13: Tournament Mode (Default)

Playbook 14: Evolution Brief

Playbook 15: Ensemble Synthesis

Playbook 16: Classic Mode Fallback

Playbook 17: Grid-Medic Escalation

✅ QA Checklist

🔍 YAML/Markdown Validation

📊 Coverage Matrix

FilesExpand file tree

TESTING.md

Latest commit

History

TESTING.md

File metadata and controls

🧪 Testing Guide

🎮 How to Test Locally

📋 Conversation Playbooks

Playbook 1: Basic Code Hackathon

Playbook 2: Review Mode

Playbook 3: Custom Models

Playbook 4: Model Failure

Playbook 5: ELO Persistence

Playbook 6: Smart Merge

Playbook 7: Rematch Mode

Playbook 8: Replay Export

Playbook 9: Post-Match Analytics

Playbook 10: Persistent ELO

Playbook 11: Model Tier Selection

Playbook 12: Stall Detection

Playbook 13: Tournament Mode (Default)

Playbook 14: Evolution Brief

Playbook 15: Ensemble Synthesis

Playbook 16: Classic Mode Fallback

Playbook 17: Grid-Medic Escalation

✅ QA Checklist

🔍 YAML/Markdown Validation

📊 Coverage Matrix