Read this once – build correctly forever. Any deviation without an approved change‑request (CR) will be rejected at review.
Latest update: 2025‑07‑24 ▪ Schema version: 1.0.0 ▪ Maintainer: VEAB
Large Language Models (LLMs) can already pass ≥ 90 % of unit tests on many coding benchmarks, yet they frequently blow past strict runtime (TLE) and memory (MLE) limits in competitive‑programming and latency‑sensitive production systems. SwiftSolve closes that gap by co‑optimising correctness and Big‑O efficiency through a modular, agentic pipeline.
- A user (human or orchestrator) POSTs a natural‑language problem description to
/solveor feeds a JSON task to the CLI runner. - The Planner (Claude 4 Opus) converts prose into a structured algorithmic sketch.
- A Static Pruner shields the system from obviously inefficient plans (e.g.
O(n²)sort‑in‑loop patterns). - The Coder (GPT‑4.1) turns the approved plan into ISO C++ 17 source.
- The Profiler compiles and executes the code inside a deterministic Docker sandbox, capturing wall‑time + RSS across logarithmic input scales.
- The Complexity Analyst analyses the telemetry, fits empirical complexity, checks constraints, and either: * declares success, or * routes a JSON patch to the Coder (local fix) or the Planner (algorithmic overhaul).
- All artefacts are persisted as JSON under
logs/<task_id>/iter_<i>/for full replay.
| Principle | Rationale |
|---|---|
| Modularity | Swap Planner/Coder models independently |
| Determinism | Seed control + cached prompts + Docker sandbox |
| Cost Awareness | Static pruning + early termination to minimise token spend |
| Reproducibility | Every artefact logged with schema version & UTC timestamp |
| Security | Untrusted C++ runs inside cgroup‑limited container |
| Agent | Model / Tool | Function | In → Out |
|---|---|---|---|
| Planner | Claude 4 Opus | Algorithm strategy synthesis | ProblemInput → PlanMessage |
| S‑Pruner | regex + AST | Heuristic plan rejection | PlanMessage → bool |
| Coder | GPT‑4.1 | C++ 17 code generation | PlanMessage → CodeMessage |
| Profiler | g++ + /usr/bin/time | Empirical runtime & memory capture | CodeMessage → ProfileReport |
| Analyst | GPT‑4.1 + heuristic | Complexity fit & routing decision | ProfileReport → VerdictMessage |
- Validation – Incoming JSON must satisfy
ProblemInputschema; otherwise FastAPI 422. - Plan Generation – Planner prompt contains: problem text, required JSON keys, max tokens 512. Response must parse under
PlanMessage.model_validate(). - Static Prune – If
validate(plan)==False→ returnRunResult(status=STATIC_PRUNE_FAILED). - Iterative Loop (
iter = 0..max_iter−1)- Call Coder; response parsed into
CodeMessage. Failure? retry once → elseRunResult(status=FAILED). - Profiler compiles with
g++ -O2 -std=c++17. Resource caps: wall =runtime_limit+250ms, RSS ≤ 512 MB, stack 256 MB. Parse/usr/bin/time -v. - Analyst receives
ProfileReport, fits log‑log slope, classifies complexity & memory class; returnsVerdictMessage. - Check
verdict.efficient. If true → success. Else route:target_agent==CODER→ patch prompt; Coder reruns.- else → Planner revise prompt; go to 4.1.
- Stop if
perf_gain < diminish_deltaoriter==max_iter−1or ≥ 2 agent crashes.
- Call Coder; response parsed into
- Assemble
RunResult, write tologs_uri, return via API.
All six message types (PlanMessage, CodeMessage, ProfileReport, VerdictMessage, ProblemInput, RunResult) are fully defined in schemas/__init__.py and must not be duplicated elsewhere.
- Loop depth > 2 &
n ≥ 1e5→ reject. - Sort‑in‑loop &
n ≥ 1e3→ reject. - Unbounded recursion &
n ≥ 1e4→ reject. - Configurable via
static_pruner/pruner.tomlhot‑reload.
- Namespace: non‑root, seccomp
docker/default+ no net. resource.setrlimit:RLIMIT_AS= 512 MBRLIMIT_STACK= 256 MBRLIMIT_FSIZE= 50 MB (stdout/stderr cap)
- Crash codes propagated to controller as
RunStatus.SANDBOX_ERROR.
| Metric | Definition |
|---|---|
| pass@k | ≥ 1 of top‑k programs passes official unit tests |
| eff@k_runtime | ≥ 1 of top‑k passes time limit |
| eff@k_memory | ≥ 1 of top‑k stays < memory limit |
| TLE/MLE rate | % of executions exceeding runtime or memory cap |
| Iteration count | Mean #loops until efficient==true |
- BigO(Bench) – 50 tasks across O(1)–O(n²).
- Codeforces Div‑2 – 25 tasks (800–1800 rating).
- Stored under
datasets/<source>/task_<id>.jsonfollowingtask_format.py.
No user code is stored beyond telemetry; LLM prompts are cached encrypted at rest (AES‑GCM) if CACHE_ENCRYPT_KEY is set.
/solve– POST – bodyProblemInput, responseRunResult./healthz– GET – returns 200 + git hash + schema version.
- Python 3.11.13 (orchestrator)
- C++ 17 (generated code)
- Bash + GNU coreutils (sandbox scripts)
| Domain | Package | Version | Notes |
|---|---|---|---|
| API | FastAPI, Uvicorn[standard] | 0.116.1 | ASGI + Hot reload |
| LLM SDKs | openai, anthropic | 1.95.1 · 0.57.1 | Model calls |
| Schema | Pydantic v2.x + pydantic‑settings | 2.9.2 | Validation + env config |
| Testing | pytest, pytest‑asyncio | 8.1.0 · 0.23.5 | Unit & async tests |
| Lint | Ruff, Mypy, Pre‑commit | latest | CI static analysis |
| Logging | Rich, loguru (optional) | 13.7.1 | Colour logs + tracebacks |
| Data | pandas, plotnine, orjson | pinned | Evaluation & plotting |
- Docker 24.x ➜ Ubuntu 20.04 base image.
- Terraform 1.8.x ➜ GKE Autopilot cluster.
- GitHub Actions CI ➜ 3.11/3.12 matrix, push & PR.
| Variable | Purpose |
|---|---|
OPENAI_API_KEY |
GPT‑4.1 access |
ANTHROPIC_API_KEY |
Claude 4 Opus access |
LOG_LEVEL |
Default INFO, override to DEBUG |
CACHE_ENCRYPT_KEY |
32‑byte key for encrypted prompt cache |
- Semantic pin (
==) for prod deps inrequirements.txt. - Dev‑only tools (
ruff,mypy) inrequirements-dev.txt. pyproject.tomlsetsrequires-python == 3.11.13.
src/swiftsolve/
├── main.py # FastAPI + CLI entry
├── api/routes.py # /solve, /healthz
├── controller/
│ ├── solve_loop.py # Core orchestration
│ └── router.py # FastAPI wrapper (planned)
├── agents/
│ ├── base.py # Agent ABC + retry/caching hooks
│ ├── planner.py # Claude client, prompt templates
│ ├── coder.py # GPT‑4.1 client, code‑json extraction
│ ├── profiler.py # Sandbox wrapper + telemetry parse
│ └── analyst.py # Complexity fit + patch routing
├── static_pruner/pruner.py # Regex + AST heuristics
├── sandbox/
│ ├── run_in_sandbox.py # g++ compile & exec with limits
│ └── docker_utils.py # image build & push (WIP)
├── schemas/__init__.py # Unified Pydantic v2 models
├── datasets/
│ ├── parse_bigobench.py # HTML/JSON → task format
│ ├── parse_codeforces.py # Scrape + IO normalisation
│ └── task_format.py # Shared spec & validation
├── evaluation/
│ ├── metrics.py # pass@k, eff@k_* calculators
│ └── stats.py # DataFrame aggregation & plots
├── utils/
│ ├── config.py # pydantic‑settings Settings()
│ └── logger.py # Colour + rotating logs
└── tests/
├── test_agents.py # Planner/Coder mocks
├── test_pruner.py # Heuristic edge cases
├── test_schema.py # JSON round‑trip & forbid extras
└── test_sandbox.py # Compile & run smoke test
Every file must include docstring headers explaining purpose, inputs, and side‑effects.
The following hierarchy represents all work items required for SwiftSolve v1.0. Completed items are prefixed ✔.
- A1 Environment Setup
- A1.1 Create pyenv 3.11.13 and poetry project.
- A1.2 Add
requirements.txt&requirements-dev.txt. - A1.3 Configure pre‑commit (ruff, black, mypy, isort).
- A2 Schema Layer
- A2.1 Design envelope & message enums.
- ✔ A2.2 Implement
schemas/__init__.pywith Pydantic v2. - A2.3 Add
test_schema.py(field presence, forbid extras).
- A3 Logging & Config
- ✔ A3.1
utils.loggercolour + rotating handler. - ✔ A3.2
utils.configSettings singleton via pydantic‑settings.
- ✔ A3.1
- B1 Planner Agent
- B1.1 Prompt template design (JSON‑only output).
- B1.2 Claude client wrapper with retry & cache (Sqlite + orjson).
- B1.3 Unit tests: valid JSON, fallback plan.
- B2 Static Pruner
- ✔ B2.1 Regex + AST rule implementation.
- B2.2
pruner.tomlexternal rule config. - B2.3 Benchmark false‑positive rate on 100 plans.
- B3 Coder Agent
- B3.1 JSON‑only code prompt; ensure includes & I/O.
- B3.2 Escape‑sequence cleaning; compile smoke test.
- B3.3 Inject optional chrono & memory hooks on
debugflag.
- B4 Solve Loop v0
- B4.1 Integrate Planner → Pruner → Coder chain.
- B4.2 FastAPI
/solvereturns stubRunResult.
- C1 Sandbox Runtime
- C1.1 Dockerfile (Ubuntu 20.04, g++‑10, time, gprof).
- C1.2
run_in_sandbox.pycompile→run→time→RSS. - C1.3
resource.setrlimitcaps; stack 256 MB.
- C2 Profiler Agent
- C2.1 Generate logarithmic input scales.
- C2.2 Parse
/usr/bin/time -v(regex). - C2.3 Produce
ProfileReport.
- C3 Complexity Analyst
- C3.1 Heuristic slope fit (log‑log).
- C3.2 GPT‑4.1 fallback for ambiguous curves.
- C3.3 Patch routing logic (
TargetAgent).
- C4 Termination Logic
- C4.1 Implement
perf_gaincheck vsdiminish_delta. - C4.2 Loop abort on 2 agent failures.
- C4.1 Implement
- D1 Dataset Parsers
- D1.1 HTML scrape BigO(Bench) tasks.
- D1.2 REST scrape Codeforces tasks.
- D1.3 Validate against
task_format.py.
- D2 Batch Runner
- D2.1 CLI flags:
--benchmark,--seeds,--replans. - D2.2 Multiprocess pool, progress bar (tqdm).
- D2.3 Store artefacts under
results/<task>/seed_<s>/.
- D2.1 CLI flags:
- D3 Metrics & Plots
- D3.1 Calc pass@k, eff@k_*, TLE/MLE.
- D3.2 Plot runtime curves (plotnine) per class.
- D3.3 Generate Markdown + CSV summary.
- E1 Infrastructure as Code
- E1.1 Terraform modules for GKE Autopilot.
- E1.2 GCS bucket + Cloud NAT egress.
- E2 Scaling Run
- E2.1 Run 12 000 cycles; monitor cost.
- E2.2 Download results + run evaluation.
- E3 Paper & Artefact
- E3.1 Write methodology, experiments, ablations.
- E3.2 Insert tables: eff@1, iteration counts.
- E3.3 Prepare artefact for NeurIPS reproducibility checklist.
Note that for testing, we will be using gpt-4.1 and claude-4-opus for budgeting reasons.