SwiftSolve – Comprehensive Technical Specification (v1.0.0)

Read this once – build correctly forever. Any deviation without an approved change‑request (CR) will be rejected at review.

Latest update: 2025‑07‑24 ▪ Schema version: 1.0.0 ▪ Maintainer: VEAB

1 Project Context 📰

1.1 Problem Statement

Large Language Models (LLMs) can already pass ≥ 90 % of unit tests on many coding benchmarks, yet they frequently blow past strict runtime (TLE) and memory (MLE) limits in competitive‑programming and latency‑sensitive production systems. SwiftSolve closes that gap by co‑optimising correctness and Big‑O efficiency through a modular, agentic pipeline.

1.2 High‑Level Storyboard

A user (human or orchestrator) POSTs a natural‑language problem description to /solve or feeds a JSON task to the CLI runner.
The Planner (Claude 4 Opus) converts prose into a structured algorithmic sketch.
A Static Pruner shields the system from obviously inefficient plans (e.g. O(n²) sort‑in‑loop patterns).
The Coder (GPT‑4.1) turns the approved plan into ISO C++ 17 source.
The Profiler compiles and executes the code inside a deterministic Docker sandbox, capturing wall‑time + RSS across logarithmic input scales.
The Complexity Analyst analyses the telemetry, fits empirical complexity, checks constraints, and either: * declares success, or * routes a JSON patch to the Coder (local fix) or the Planner (algorithmic overhaul).
All artefacts are persisted as JSON under logs/<task_id>/iter_<i>/ for full replay.

1.3 Design Principles

Principle	Rationale
Modularity	Swap Planner/Coder models independently
Determinism	Seed control + cached prompts + Docker sandbox
Cost Awareness	Static pruning + early termination to minimise token spend
Reproducibility	Every artefact logged with schema version & UTC timestamp
Security	Untrusted C++ runs inside cgroup‑limited container

1.4 Agent Contract Summary

Agent	Model / Tool	Function	In → Out
Planner	Claude 4 Opus	Algorithm strategy synthesis	`ProblemInput` → `PlanMessage`
S‑Pruner	regex + AST	Heuristic plan rejection	`PlanMessage` → bool
Coder	GPT‑4.1	C++ 17 code generation	`PlanMessage` → `CodeMessage`
Profiler	g++ + /usr/bin/time	Empirical runtime & memory capture	`CodeMessage` → `ProfileReport`
Analyst	GPT‑4.1 + heuristic	Complexity fit & routing decision	`ProfileReport` → `VerdictMessage`

1.5 End‑to‑End Control Flow (verbose)

Validation – Incoming JSON must satisfy ProblemInput schema; otherwise FastAPI 422.
Plan Generation – Planner prompt contains: problem text, required JSON keys, max tokens 512. Response must parse under PlanMessage.model_validate().
Static Prune – If validate(plan)==False → return RunResult(status=STATIC_PRUNE_FAILED).
Iterative Loop (iter = 0..max_iter−1)
1. Call Coder; response parsed into CodeMessage. Failure? retry once → else RunResult(status=FAILED).
2. Profiler compiles with g++ -O2 -std=c++17. Resource caps: wall = runtime_limit+250 ms, RSS ≤ 512 MB, stack 256 MB. Parse /usr/bin/time -v.
3. Analyst receives ProfileReport, fits log‑log slope, classifies complexity & memory class; returns VerdictMessage.
4. Check verdict.efficient. If true → success. Else route:
  - target_agent==CODER → patch prompt; Coder reruns.
  - else → Planner revise prompt; go to 4.1.
5. Stop if perf_gain < diminish_delta or iter==max_iter−1 or ≥ 2 agent crashes.
Assemble RunResult, write to logs_uri, return via API.

1.6 JSON Communication Spec (excerpt)

// PlanMessage (envelope + payload)
{
  "type": "plan",
  "task_id": "CF1285C",
  "iteration": 0,
  "timestamp_utc": "2025‑07‑24T20:41:11Z",
  "schema_version": "1.0.0",
  "algorithm": "two‑pointer sliding window",
  "input_bounds": {"n": 100000},
  "constraints": {"runtime_limit": 2000, "memory_limit": 512}
}

All six message types (PlanMessage, CodeMessage, ProfileReport, VerdictMessage, ProblemInput, RunResult) are fully defined in schemas/__init__.py and must not be duplicated elsewhere.

1.7 Static Pruner Rulebook

Loop depth > 2 & n ≥ 1e5 → reject.
Sort‑in‑loop & n ≥ 1e3 → reject.
Unbounded recursion & n ≥ 1e4 → reject.
Configurable via static_pruner/pruner.toml hot‑reload.

1.8 Sandbox Constraints

Namespace: non‑root, seccomp docker/default + no net.
resource.setrlimit:
- RLIMIT_AS = 512 MB
- RLIMIT_STACK = 256 MB
- RLIMIT_FSIZE = 50 MB (stdout/stderr cap)
Crash codes propagated to controller as RunStatus.SANDBOX_ERROR.

1.9 Evaluation Metrics

Metric	Definition
pass@k	≥ 1 of top‑k programs passes official unit tests
eff@k_runtime	≥ 1 of top‑k passes time limit
eff@k_memory	≥ 1 of top‑k stays < memory limit
TLE/MLE rate	% of executions exceeding runtime or memory cap
Iteration count	Mean #loops until `efficient==true`

1.10 Datasets

BigO(Bench) – 50 tasks across O(1)–O(n²).
Codeforces Div‑2 – 25 tasks (800–1800 rating).
Stored under datasets/<source>/task_<id>.json following task_format.py.

1.11 Security & Privacy

No user code is stored beyond telemetry; LLM prompts are cached encrypted at rest (AES‑GCM) if CACHE_ENCRYPT_KEY is set.

1.12 OpenAPI Exposure

/solve – POST – body ProblemInput, response RunResult.
/healthz – GET – returns 200 + git hash + schema version.

2 Tech Stack 🛠️

2.1 Languages & Runtimes

Python 3.11.13 (orchestrator)
C++ 17 (generated code)
Bash + GNU coreutils (sandbox scripts)

2.2 Key Libraries

Domain	Package	Version	Notes
API	FastAPI, Uvicorn[standard]	0.116.1	ASGI + Hot reload
LLM SDKs	openai, anthropic	1.95.1 · 0.57.1	Model calls
Schema	Pydantic v2.x + pydantic‑settings	2.9.2	Validation + env config
Testing	pytest, pytest‑asyncio	8.1.0 · 0.23.5	Unit & async tests
Lint	Ruff, Mypy, Pre‑commit	latest	CI static analysis
Logging	Rich, loguru (optional)	13.7.1	Colour logs + tracebacks
Data	pandas, plotnine, orjson	pinned	Evaluation & plotting

2.3 Infrastructure

Docker 24.x ➜ Ubuntu 20.04 base image.
Terraform 1.8.x ➜ GKE Autopilot cluster.
GitHub Actions CI ➜ 3.11/3.12 matrix, push & PR.

2.4 Environment Variables

Variable	Purpose
`OPENAI_API_KEY`	GPT‑4.1 access
`ANTHROPIC_API_KEY`	Claude 4 Opus access
`LOG_LEVEL`	Default `INFO`, override to `DEBUG`
`CACHE_ENCRYPT_KEY`	32‑byte key for encrypted prompt cache

2.5 Version Pinning Rules

Semantic pin (==) for prod deps in requirements.txt.
Dev‑only tools (ruff, mypy) in requirements-dev.txt.
pyproject.toml sets requires-python == 3.11.13.

3 Repository Layout & File Descriptions 📁

src/swiftsolve/
├── main.py               # FastAPI + CLI entry
├── api/routes.py         # /solve, /healthz
├── controller/
│   ├── solve_loop.py     # Core orchestration
│   └── router.py         # FastAPI wrapper (planned)
├── agents/
│   ├── base.py           # Agent ABC + retry/caching hooks
│   ├── planner.py        # Claude client, prompt templates
│   ├── coder.py          # GPT‑4.1 client, code‑json extraction
│   ├── profiler.py       # Sandbox wrapper + telemetry parse
│   └── analyst.py        # Complexity fit + patch routing
├── static_pruner/pruner.py # Regex + AST heuristics
├── sandbox/
│   ├── run_in_sandbox.py   # g++ compile & exec with limits
│   └── docker_utils.py     # image build & push (WIP)
├── schemas/__init__.py     # Unified Pydantic v2 models
├── datasets/
│   ├── parse_bigobench.py  # HTML/JSON → task format
│   ├── parse_codeforces.py # Scrape + IO normalisation
│   └── task_format.py      # Shared spec & validation
├── evaluation/
│   ├── metrics.py          # pass@k, eff@k_* calculators
│   └── stats.py            # DataFrame aggregation & plots
├── utils/
│   ├── config.py           # pydantic‑settings Settings()
│   └── logger.py           # Colour + rotating logs
└── tests/
    ├── test_agents.py      # Planner/Coder mocks
    ├── test_pruner.py      # Heuristic edge cases
    ├── test_schema.py      # JSON round‑trip & forbid extras
    └── test_sandbox.py     # Compile & run smoke test

Every file must include docstring headers explaining purpose, inputs, and side‑effects.

4 Task Mega‑Spec 📋

The following hierarchy represents all work items required for SwiftSolve v1.0. Completed items are prefixed ✔.

4.1 Phase A – Foundations

A1 Environment Setup
- A1.1 Create pyenv 3.11.13 and poetry project.
- A1.2 Add requirements.txt & requirements-dev.txt.
- A1.3 Configure pre‑commit (ruff, black, mypy, isort).
A2 Schema Layer
- A2.1 Design envelope & message enums.
- ✔ A2.2 Implement schemas/__init__.py with Pydantic v2.
- A2.3 Add test_schema.py (field presence, forbid extras).
A3 Logging & Config
- ✔ A3.1 utils.logger colour + rotating handler.
- ✔ A3.2 utils.config Settings singleton via pydantic‑settings.

4.2 Phase B – Core Loop (MVP)

B1 Planner Agent
- B1.1 Prompt template design (JSON‑only output).
- B1.2 Claude client wrapper with retry & cache (Sqlite + orjson).
- B1.3 Unit tests: valid JSON, fallback plan.
B2 Static Pruner
- ✔ B2.1 Regex + AST rule implementation.
- B2.2 pruner.toml external rule config.
- B2.3 Benchmark false‑positive rate on 100 plans.
B3 Coder Agent
- B3.1 JSON‑only code prompt; ensure includes & I/O.
- B3.2 Escape‑sequence cleaning; compile smoke test.
- B3.3 Inject optional chrono & memory hooks on debug flag.
B4 Solve Loop v0
- B4.1 Integrate Planner → Pruner → Coder chain.
- B4.2 FastAPI /solve returns stub RunResult.

4.3 Phase C – Profiler & Analyst

C1 Sandbox Runtime
- C1.1 Dockerfile (Ubuntu 20.04, g++‑10, time, gprof).
- C1.2 run_in_sandbox.py compile→run→time→RSS.
- C1.3 resource.setrlimit caps; stack 256 MB.
C2 Profiler Agent
- C2.1 Generate logarithmic input scales.
- C2.2 Parse /usr/bin/time -v (regex).
- C2.3 Produce ProfileReport.
C3 Complexity Analyst
- C3.1 Heuristic slope fit (log‑log).
- C3.2 GPT‑4.1 fallback for ambiguous curves.
- C3.3 Patch routing logic (TargetAgent).
C4 Termination Logic
- C4.1 Implement perf_gain check vs diminish_delta.
- C4.2 Loop abort on 2 agent failures.

4.4 Phase D – Dataset & Metrics

D1 Dataset Parsers
- D1.1 HTML scrape BigO(Bench) tasks.
- D1.2 REST scrape Codeforces tasks.
- D1.3 Validate against task_format.py.
D2 Batch Runner
- D2.1 CLI flags: --benchmark, --seeds, --replans.
- D2.2 Multiprocess pool, progress bar (tqdm).
- D2.3 Store artefacts under results/<task>/seed_<s>/.
D3 Metrics & Plots
- D3.1 Calc pass@k, eff@k_*, TLE/MLE.
- D3.2 Plot runtime curves (plotnine) per class.
- D3.3 Generate Markdown + CSV summary.

4.5 Phase E – Deployment & Publication

E1 Infrastructure as Code
- E1.1 Terraform modules for GKE Autopilot.
- E1.2 GCS bucket + Cloud NAT egress.
E2 Scaling Run
- E2.1 Run 12 000 cycles; monitor cost.
- E2.2 Download results + run evaluation.
E3 Paper & Artefact
- E3.1 Write methodology, experiments, ablations.
- E3.2 Insert tables: eff@1, iteration counts.
- E3.3 Prepare artefact for NeurIPS reproducibility checklist.

Note that for testing, we will be using gpt-4.1 and claude-4-opus for budgeting reasons.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SwiftSolve – Comprehensive Technical Specification (v1.0.0)

1 Project Context 📰

1.1 Problem Statement

1.2 High‑Level Storyboard

1.3 Design Principles

1.4 Agent Contract Summary

1.5 End‑to‑End Control Flow (verbose)

1.6 JSON Communication Spec (excerpt)

1.7 Static Pruner Rulebook

1.8 Sandbox Constraints

1.9 Evaluation Metrics

1.10 Datasets

1.11 Security & Privacy

1.12 OpenAPI Exposure

2 Tech Stack 🛠️

2.1 Languages & Runtimes

2.2 Key Libraries

2.3 Infrastructure

2.4 Environment Variables

2.5 Version Pinning Rules

3 Repository Layout & File Descriptions 📁

4 Task Mega‑Spec 📋

4.1 Phase A – Foundations

4.2 Phase B – Core Loop (MVP)

4.3 Phase C – Profiler & Analyst

4.4 Phase D – Dataset & Metrics

4.5 Phase E – Deployment & Publication

FilesExpand file tree

CONTEXT.md

Latest commit

History

CONTEXT.md

File metadata and controls

SwiftSolve – Comprehensive Technical Specification (v1.0.0)

1 Project Context 📰

1.1 Problem Statement

1.2 High‑Level Storyboard

1.3 Design Principles

1.4 Agent Contract Summary

1.5 End‑to‑End Control Flow (verbose)

1.6 JSON Communication Spec (excerpt)

1.7 Static Pruner Rulebook

1.8 Sandbox Constraints

1.9 Evaluation Metrics

1.10 Datasets

1.11 Security & Privacy

1.12 OpenAPI Exposure

2 Tech Stack 🛠️

2.1 Languages & Runtimes

2.2 Key Libraries

2.3 Infrastructure

2.4 Environment Variables

2.5 Version Pinning Rules

3 Repository Layout & File Descriptions 📁

4 Task Mega‑Spec 📋

4.1 Phase A – Foundations

4.2 Phase B – Core Loop (MVP)

4.3 Phase C – Profiler & Analyst

4.4 Phase D – Dataset & Metrics

4.5 Phase E – Deployment & Publication