PromptReboot
Multi-agent prompt diagnostics for identifying reliability issues in LLM prompts before they are used in production workflows.
This tool analyzes prompts for structural failure modes such as ambiguity, missing constraints, conflicting instructions, and evaluation gaps. It returns diagnostic findings instead of rewriting prompts.
The goal is to surface problems early so prompt design decisions remain explicit and under developer control.
What this tool does
PromptReboot runs multiple diagnostic agents in parallel, each responsible for detecting a specific class of prompt failure.
The system:
- analyzes prompts using targeted diagnostic passes
- detects Medium and High severity prompt design issues
- validates agent outputs against a strict schema
- combines findings deterministically in code
- avoids rewriting or optimizing prompts
- skips alignment analysis when goal and example are absent
This is a diagnostic system, not a prompt optimizer.
Failure modes detected
The current agents detect issues in categories including:
- Goal–Prompt–Example misalignment
- Hard instruction contradictions
- Soft instruction contradictions
- Role overload
- Audience–voice mismatch
- Vague success criteria
- Missing priority ordering
- Implicit domain assumptions
- Over-constraint (brittleness)
- Under-constraint (hallucination risk)
- Ambiguous constraint scope
- No self-check or validation step
- Example output misuse
- Single example overfitting
- Unscaffolded complex reasoning
- Hidden intermediate requirements
Agents only report issues when they are likely to materially affect correctness, consistency, or reliability.
If no issues are detected, the system returns an empty findings list.
Architecture overview
Execution is parallel, deterministic, and schema-validated.
Flow:
Prompt → Input serialization → Parallel agent execution → Validation → Combined findings
Key properties:
- agents run concurrently using asyncio
- each agent uses a strict diagnostic system prompt
- no aggregator LLM is used
- findings are validated at runtime
- only "High" and "Medium" severities are allowed
- per-agent ordering is enforced (High before Medium)
- alignment agent is skipped deterministically when not applicable
The execution model lives in graphy.py.
Example
Prompt:
"Summarize this email thread and decide whether the customer should get a refund."
Typical findings:
- Vague success criteria
- Under-constraint (hallucination risk)
- No self-check or validation step
The tool explains why each issue affects output reliability and cites the relevant prompt text.
Design philosophy
Diagnostics over rewriting
Rewriting a prompt requires making decisions the original prompt did not specify. This tool surfaces problems instead of silently resolving them.
Separation of analysis and design
Prompt analysis identifies failure modes. Prompt design remains the user's responsibility.
Precision-first reporting
Agents report only Medium and High confidence issues that are supported by concrete evidence from the prompt.
Silence is success
If no issues are detected, the correct output is an empty findings list.
Running locally
pip install -r requirements.txt cd backend uvicorn api.asgi:app
Repository structure
backend/graphy.py Parallel diagnostic execution engine.
backend/llm/agents/ Agent configurations and category definitions.
backend/llm/prompts/ Diagnostic system prompts used by each agent.
backend/infra/ Logging and shared utilities.
backend/api/ ASGI application for running diagnostics.
Notes
This project is intentionally small and focused. It is designed to be used as a prompt analysis step before prompts are deployed into LLM workflows.
The system prioritizes deterministic execution, schema validation, and strict agent scope boundaries over flexibility or prompt optimization.