This project should choose models the same way it chooses architecture: by task shape, not by habit.
Use the strongest reasoning model only when ambiguity, synthesis, or architectural risk justifies it.
Use coding-optimized or smaller models only when the task is bounded enough that speed matters more than broad reasoning.
- Model:
gpt-5.4 - Reasoning:
highorxhigh - Use when:
- scope is unclear
- trade-offs are non-obvious
- decisions affect the whole system
- we need a skeptical review after implementation
Why:
- OpenAI's code generation guidance says to start with
gpt-5.4for most coding tasks and broader workflows, especially when the work includes reasoning about requirements and mixed tasks.
Source:
- Model:
gpt-5.4 - Reasoning:
mediumorhigh - Use when:
- implementing a feature across multiple files
- writing tests plus code
- refactoring storage, APIs, or behavior
Why:
- This project mixes design, Python code, and MCP integration. The general-purpose model is the safest default.
- Model:
gpt-5.3-codex - Reasoning:
mediumorhigh - Use when:
- one file or one module has clear ownership
- the contract is already decided
- the worker is not responsible for product direction
Why:
- The project workflow notes favor delegation only when the task is partitionable and the synchronization cost is low.
- Model:
gpt-5.4-minior equivalent small coding worker - Reasoning:
lowormedium - Use when:
- formatting or rote edits
- narrow file inspection
- quick verification that does not drive architecture
low: only for trivial edits or lookupsmedium: default for implementation once the contract is clearhigh: default for important code changes or non-trivial debuggingxhigh: reserve for product shaping, architecture, and final validation
For agent-memory-bridge, use this sequence:
gpt-5.4withhighorxhighto define the product slice.gpt-5.4orgpt-5.3-codexwithmediumorhighto implement bounded milestones.gpt-5.4withhighto validate the milestone against the PRD.
That matches the project goal:
- avoid speculative features
- prove the smallest useful slice
- validate before expanding
- Benchmark and retrieval evaluation:
gpt-5.4,high - Codex conversation-ingest design:
gpt-5.4,high - Bounded code patches after the ingest contract is settled:
gpt-5.3-codex,medium