You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document describes how to verify the Havoc Hackathon skill works correctly.
Since this is a conversational AI skill (not traditional code), testing is done through conversation playbooks - scripted interactions that verify expected behavior.
🎮 How to Test Locally
Register the skill in a Copilot CLI session:
/skills add ./
Run each playbook below and verify the expected behavior.
Check the QA checklist at the bottom before submitting a PR.
📋 Conversation Playbooks
Playbook 1: Basic Code Hackathon
Step
You Say
Expected Behavior
1
run hackathon - write a fizzbuzz function
Opening ceremony with arena banner, contestants, rubric
2
(accept defaults or customize)
3 models dispatched in parallel with progress commentary
3
(wait for completion)
All models finish, outputs normalized
4
(judging phase)
Sealed panel scores anonymized submissions
5
(results)
Drumroll → winner reveal → ASCII podium → ELO update
Playbook 2: Review Mode
Step
You Say
Expected Behavior
1
run hackathon - review @src/app.js for security issues
Detects review mode, adjusts rubric
2
(models complete)
Each model produces structured findings
3
(results)
Ensemble report showing consensus findings
Playbook 3: Custom Models
Step
You Say
Expected Behavior
1
hackathon with opus and gemini - refactor this function
Only 2 models dispatched (head-to-head mode)
2
(results)
Head-to-head comparison, no bracket/tournament
Playbook 4: Model Failure
Step
You Say
Expected Behavior
1
run hackathon on a complex task
If a model fails, it retries once
2
(second failure)
Model DQ'd with flair ("💀 ELIMINATED")
3
(remaining models)
Hackathon continues with surviving contestants
Playbook 5: ELO Persistence
Step
You Say
Expected Behavior
1
Run hackathon #1
ELO table created, initial ratings set
2
Run hackathon #2
ELO from previous run shown, ratings updated
3
show leaderboard
Current ELO rankings displayed
Playbook 6: Smart Merge
Step
You Say
Expected Behavior
1
Complete a build hackathon
Merge options presented
2
Select "Smart merge"
Best components cherry-picked from each submission
3
(verify)
Build passes, tests pass after merge
Playbook 7: Rematch Mode
Step
You Say
Expected Behavior
1
Run a hackathon that finishes close (≤2 pts margin)
Offered "🔥 Want a rematch with a tiebreaker?"
2
Accept and pick a 6th criterion
Re-judging on new criterion only
3
(results)
Combined scores reveal final winner
Playbook 8: Replay Export
Step
You Say
Expected Behavior
1
Complete any hackathon
Offered "📼 Want the highlight reel?"
2
Accept
Markdown file saved with full transcript
3
(verify)
File contains banner, submissions, scores, podium
Playbook 9: Post-Match Analytics
Step
You Say
Expected Behavior
1
Run 2+ hackathons in one session
Performance data accumulated
2
show stats or show leaderboard
Model trends, win rates, ASCII bar charts
3
(verify)
Per-model breakdown by task type shown
Playbook 10: Persistent ELO
Step
You Say
Expected Behavior
1
Run a hackathon
ELO saved to ~/.copilot/hackathon-elo.json
2
Start a new Copilot CLI session
ELO loaded from JSON file into SQL
3
Run another hackathon
Previous ELO ratings shown in Phase 0
Playbook 11: Model Tier Selection
Step
You Say
Expected Behavior
1
run hackathon - write a haiku
Prompted: "⚡ Model tier? Standard or Premium"
2
Select "Standard"
Standard contestants and judges used, ⚡ badges shown
3
run hackathon with premium models - write a haiku
No tier prompt, premium models used directly, 👑 badges shown
4
hackathon with opus and gemini - write a haiku
No tier prompt, named models used directly
Playbook 12: Stall Detection
Step
You Say
Expected Behavior
1
Run a hackathon on a complex task
Models dispatched, progress monitored
2
(model stalls for 180s)
Prompted: "⏳ {Model} has been silent for 3 minutes. Want to keep waiting or DQ?"
3
Select "Keep waiting (60s more)"
Timer extended by 60 seconds
4
(model stalls again)
Auto-DQ with commentary: "💀 {Model} went AFK. No mercy in this arena."
5
(alternative: select "DQ and continue")
Model DQ'd immediately, hackathon continues with remaining contestants
Playbook 13: Tournament Mode (Default)
Step
You Say
Expected Behavior
1
run hackathon - refactor this module
Tournament mode activates by default, all available models enter
2
(Round 1)
Models grouped into heats (elastic brackets based on count). Each heat runs in parallel.
3
(heat judging)
Per-heat judge panels (3 judges each) score independently in parallel
4
(heat results)
Mini-ceremony per heat winner: "🏅 {Model} takes Heat {N}!"
5
(Round 2)
Finalists dispatched with Evolution Brief prepended
6
(finals judging + results)
Full ceremony, tournament bracket recap in closing
Playbook 14: Evolution Brief
Step
You Say
Expected Behavior
1
Complete Round 1 of a tournament hackathon
Evolution Brief generated from judge scores
2
(verify)
Brief includes: winning strategies per heat, top scoring categories, key differentiators
3
(Round 2)
Finalists receive the brief — commentary: "🧬 Finalists have studied the playbook."
Playbook 15: Ensemble Synthesis
Step
You Say
Expected Behavior
1
Complete a tournament hackathon (build mode)
Merge options presented with "Ensemble synthesis" as recommended
2
Select "Ensemble synthesis"
Integrator agent analyzes ALL finalist submissions
3
(verify)
Output shows CONSENSUS/MAJORITY/UNIQUE classifications with provenance annotations
Playbook 16: Classic Mode Fallback
Step
You Say
Expected Behavior
1
run hackathon quick - write a haiku
Classic mode: 3 contestants, no heats, single round
2
(verify)
Identical behavior to v1.x — no tournament, no evolution brief
3
run hackathon fast - name this variable
Same classic mode behavior
Playbook 17: Grid-Medic Escalation
Step
You Say
Expected Behavior
1
Run grid-medic diagnose on an agent scoring ≤5/10
Grid-medic detects critical underperformance
2
(escalation prompt)
"🚨 {agent} is critically underperforming. Escalate to Havoc Hackathon?"
3
Select "Escalate to Hackathon"
Havoc Hackathon triggered to competitively rewrite the agent
4
(after hackathon completes)
Grid-medic re-diagnoses the upgraded agent, logs score delta
⚡ Classic mode ("quick"/"fast") produces same 3-model experience
🚑 Grid-medic escalation triggers hackathon for weak agents
🔍 YAML/Markdown Validation
Verify catalog metadata:
# Check YAML syntax
python3 -c "import yaml; yaml.safe_load(open('skills/havoc-hackathon/catalog.yml'))"&&echo"✅ YAML valid"# Check required fields
python3 -c "import yamld = yaml.safe_load(open('skills/havoc-hackathon/catalog.yml'))required = ['schema_version','id','name','description','emoji','codename','category']missing = [f for f in required if f not in d]print('✅ All required fields present' if not missing else f'❌ Missing: {missing}')"