8Dionysus · 8Dionysus · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026
diff --git a/.github/workflows/validate-stack.yml b/.github/workflows/validate-stack.yml
@@ -26,14 +26,15 @@ jobs:
         run: python scripts/validate_stack.py
 
       - name: Python syntax check
-        run: python -m py_compile scripts/validate_stack.py scripts/aoa-host-facts
+        run: python -m py_compile scripts/validate_stack.py scripts/aoa-host-facts scripts/aoa-local-ai-trials scripts/aoa-langgraph-pilot scripts/aoa-w5-pilot scripts/aoa-w6-pilot scripts/aoa-llamacpp-pilot
 
       - name: Shellcheck scripts
         run: |
           shellcheck \
             scripts/aoa-lib.sh \
             scripts/aoa-doctor \
             scripts/aoa-install-layout \
+            scripts/aoa-sync-federation-surfaces \
             scripts/aoa-sync-configs \
             scripts/aoa-bootstrap-configs \
             scripts/aoa-check-layout \
@@ -131,6 +132,11 @@ jobs:
           export AOA_EXTRA_COMPOSE_FILES="compose/tuning/ollama.cpu.yml"
           scripts/aoa-render-config --profile core >/dev/null
 
+          printf 'GGUFTEST' > "$RUNNER_TEMP/qwen3.5-9b.gguf"
+          export AOA_LLAMACPP_MODEL_HOST_PATH="$RUNNER_TEMP/qwen3.5-9b.gguf"
+          export AOA_EXTRA_COMPOSE_FILES="compose/modules/32-llamacpp-inference.yml,compose/modules/44-llamacpp-agent-sidecar.yml"
+          scripts/aoa-render-config --preset intel-full >/dev/null
+
       - name: Capture host-facts artifacts
         run: |
           mkdir -p "$RUNNER_TEMP/host-facts"

diff --git a/README.md b/README.md
@@ -52,31 +52,33 @@ This repository should not absorb:
 7. Read [docs/PROFILE_RECIPES](docs/PROFILE_RECIPES.md).
 8. Read [docs/RENDER_TRUTH](docs/RENDER_TRUTH.md).
 9. Read [docs/RUNTIME_BENCH_POLICY](docs/RUNTIME_BENCH_POLICY.md).
-10. Read [docs/INTERNAL_PROBES](docs/INTERNAL_PROBES.md).
-11. Read [docs/PATHS](docs/PATHS.md).
-12. Read [docs/WINDOWS_BRIDGE](docs/WINDOWS_BRIDGE.md).
-13. Read [docs/WINDOWS_SETUP](docs/WINDOWS_SETUP.md).
-14. Read [docs/WINDOWS_PERFORMANCE](docs/WINDOWS_PERFORMANCE.md).
-15. Read [docs/STORAGE_LAYOUT](docs/STORAGE_LAYOUT.md).
-16. Read [docs/REFERENCE_PLATFORM](docs/REFERENCE_PLATFORM.md).
-17. Read [docs/REFERENCE_PLATFORM_SPEC](docs/REFERENCE_PLATFORM_SPEC.md).
-18. Read [docs/MACHINE_FIT_POLICY](docs/MACHINE_FIT_POLICY.md).
-19. Read [docs/PLATFORM_ADAPTATION_POLICY](docs/PLATFORM_ADAPTATION_POLICY.md).
-20. Read [docs/BRANCH_POLICY](docs/BRANCH_POLICY.md).
-21. Read [docs/MEMO_RUNTIME_SEAM](docs/MEMO_RUNTIME_SEAM.md).
-22. Read [docs/EVAL_RUNTIME_SEAM](docs/EVAL_RUNTIME_SEAM.md).
-23. Read [docs/PLAYBOOK_RUNTIME_SEAM](docs/PLAYBOOK_RUNTIME_SEAM.md).
-24. Read [docs/MODEL_PROFILES](docs/MODEL_PROFILES.md).
-25. Read [docs/CONTEXT_BUDGET_POLICY](docs/CONTEXT_BUDGET_POLICY.md).
-26. Read [docs/RECURRENCE_RUNTIME_POLICY](docs/RECURRENCE_RUNTIME_POLICY.md).
-27. Read [docs/DEPLOYMENT](docs/DEPLOYMENT.md).
-28. Read [docs/FIRST_RUN](docs/FIRST_RUN.md).
-29. Read [docs/DOCTOR](docs/DOCTOR.md).
-30. Read [docs/SECRETS_BOOTSTRAP](docs/SECRETS_BOOTSTRAP.md).
-31. Read [docs/LIFECYCLE](docs/LIFECYCLE.md).
-32. Read [docs/RUNBOOK](docs/RUNBOOK.md).
-33. Read [docs/SECURITY](docs/SECURITY.md).
-34. Read [docs/MIGRATION_FROM_OLD](docs/MIGRATION_FROM_OLD.md).
+10. Read [docs/LLAMACPP_PILOT](docs/LLAMACPP_PILOT.md).
+11. Read [docs/LOCAL_AI_TRIALS](docs/LOCAL_AI_TRIALS.md).
+12. Read [docs/INTERNAL_PROBES](docs/INTERNAL_PROBES.md).
+13. Read [docs/PATHS](docs/PATHS.md).
+14. Read [docs/WINDOWS_BRIDGE](docs/WINDOWS_BRIDGE.md).
+15. Read [docs/WINDOWS_SETUP](docs/WINDOWS_SETUP.md).
+16. Read [docs/WINDOWS_PERFORMANCE](docs/WINDOWS_PERFORMANCE.md).
+17. Read [docs/STORAGE_LAYOUT](docs/STORAGE_LAYOUT.md).
+18. Read [docs/REFERENCE_PLATFORM](docs/REFERENCE_PLATFORM.md).
+19. Read [docs/REFERENCE_PLATFORM_SPEC](docs/REFERENCE_PLATFORM_SPEC.md).
+20. Read [docs/MACHINE_FIT_POLICY](docs/MACHINE_FIT_POLICY.md).
+21. Read [docs/PLATFORM_ADAPTATION_POLICY](docs/PLATFORM_ADAPTATION_POLICY.md).
+22. Read [docs/BRANCH_POLICY](docs/BRANCH_POLICY.md).
+23. Read [docs/MEMO_RUNTIME_SEAM](docs/MEMO_RUNTIME_SEAM.md).
+24. Read [docs/EVAL_RUNTIME_SEAM](docs/EVAL_RUNTIME_SEAM.md).
+25. Read [docs/PLAYBOOK_RUNTIME_SEAM](docs/PLAYBOOK_RUNTIME_SEAM.md).
+26. Read [docs/MODEL_PROFILES](docs/MODEL_PROFILES.md).
+27. Read [docs/CONTEXT_BUDGET_POLICY](docs/CONTEXT_BUDGET_POLICY.md).
+28. Read [docs/RECURRENCE_RUNTIME_POLICY](docs/RECURRENCE_RUNTIME_POLICY.md).
+29. Read [docs/DEPLOYMENT](docs/DEPLOYMENT.md).
+30. Read [docs/FIRST_RUN](docs/FIRST_RUN.md).
+31. Read [docs/DOCTOR](docs/DOCTOR.md).
+32. Read [docs/SECRETS_BOOTSTRAP](docs/SECRETS_BOOTSTRAP.md).
+33. Read [docs/LIFECYCLE](docs/LIFECYCLE.md).
+34. Read [docs/RUNBOOK](docs/RUNBOOK.md).
+35. Read [docs/SECURITY](docs/SECURITY.md).
+36. Read [docs/MIGRATION_FROM_OLD](docs/MIGRATION_FROM_OLD.md).
 
 For the shortest next route by intent:
 - if you need the ecosystem center, layer map, or federation rules, go to [`Agents-of-Abyss`](https://github.com/8Dionysus/Agents-of-Abyss)
@@ -89,6 +91,8 @@ For the shortest next route by intent:
 - if you need playbook meaning, activation doctrine, or authored execution bundles, go to [`aoa-playbooks`](https://github.com/8Dionysus/aoa-playbooks)
 - if you need the Windows host and WSL bridge workflow, read [docs/WINDOWS_BRIDGE](docs/WINDOWS_BRIDGE.md), [docs/WINDOWS_SETUP](docs/WINDOWS_SETUP.md), and [docs/WINDOWS_PERFORMANCE](docs/WINDOWS_PERFORMANCE.md)
 - if you need runtime benchmark ownership, storage, and manifest rules, read [docs/RUNTIME_BENCH_POLICY](docs/RUNTIME_BENCH_POLICY.md)
+- if you need the bounded llama.cpp A/B runtime pilot next to the validated Ollama path, read [docs/LLAMACPP_PILOT](docs/LLAMACPP_PILOT.md)
+- if you need bounded local-model trial contracts, W4 supervised edits, or the promoted W5/W6 local-worker path, read [docs/LOCAL_AI_TRIALS](docs/LOCAL_AI_TRIALS.md)
 - if you need normative host posture or machine-readable host-facts capture, read [docs/REFERENCE_PLATFORM](docs/REFERENCE_PLATFORM.md) and [docs/REFERENCE_PLATFORM_SPEC](docs/REFERENCE_PLATFORM_SPEC.md)
 - if you need to tune the runtime to the current machine, confirm driver freshness, or decide which preset the host should prefer, read [docs/MACHINE_FIT_POLICY](docs/MACHINE_FIT_POLICY.md)
 - if you need a compact record of platform-specific quirks, adaptations, and portability notes, read [docs/PLATFORM_ADAPTATION_POLICY](docs/PLATFORM_ADAPTATION_POLICY.md)
@@ -145,9 +149,11 @@ The stack is organized around explicit compose modules rather than one swollen f
 - `20-orchestration.yml`
 - `30-local-inference.yml`
 - `31-intel-inference.yml`
+- `32-llamacpp-inference.yml`
 - `40-llm-gateway.yml`
 - `41-agent-api.yml`
 - `42-agent-api-intel.yml`
+- `44-llamacpp-agent-sidecar.yml`
 - `50-speech.yml`
 - `51-browser-tools.yml`
 - `60-monitoring.yml`

diff --git a/compose/README.md b/compose/README.md
@@ -8,9 +8,11 @@ The new stack uses small compose modules, named profiles, and named presets.
 - `modules/20-orchestration.yml`
 - `modules/30-local-inference.yml`
 - `modules/31-intel-inference.yml`
+- `modules/32-llamacpp-inference.yml`
 - `modules/40-llm-gateway.yml`
 - `modules/41-agent-api.yml`
 - `modules/42-agent-api-intel.yml`
+- `modules/44-llamacpp-agent-sidecar.yml`
 - `modules/50-speech.yml`
 - `modules/51-browser-tools.yml`
 - `modules/60-monitoring.yml`
@@ -38,6 +40,15 @@ A profile is only a list of module filenames in activation order.
 
 A preset is a list of profile names in activation order.
 
+## Optional pilot modules
+
+`32-llamacpp-inference.yml` and `44-llamacpp-agent-sidecar.yml` are not part of the default profiles or presets.
+
+They exist for the bounded `llama.cpp` sidecar pilot and are typically activated through:
+
+- `scripts/aoa-llamacpp-pilot`
+- or `AOA_EXTRA_COMPOSE_FILES` when you intentionally want the sidecar path
+
 ## Rule
 
 New capability should arrive as:

diff --git a/compose/modules/32-llamacpp-inference.yml b/compose/modules/32-llamacpp-inference.yml
@@ -0,0 +1,33 @@
+services:
+  llama-cpp:
+    image: "${AOA_LLAMACPP_IMAGE:-ghcr.io/ggml-org/llama.cpp:server-openvino}"
+    platform: linux/amd64
+    container_name: llama-cpp
+    restart: unless-stopped
+    cpus: "${AOA_LLAMACPP_CPUS:-4.0}"
+    mem_limit: "${AOA_LLAMACPP_MEM_LIMIT:-12g}"
+    mem_reservation: "${AOA_LLAMACPP_MEM_RESERVATION:-8g}"
+    environment:
+      LLAMA_ARG_MODEL: /models/qwen3.5-9b.gguf
+      LLAMA_ARG_ALIAS: "${AOA_LLAMACPP_MODEL_ALIAS:-qwen3.5:9b}"
+      LLAMA_ARG_HOST: 0.0.0.0
+      LLAMA_ARG_PORT: "8080"
+      LLAMA_ARG_CTX_SIZE: "${AOA_LLAMACPP_CTX_SIZE:-4096}"
+      LLAMA_ARG_THREADS: "${AOA_LLAMACPP_THREADS:-4}"
+      LLAMA_ARG_THREADS_BATCH: "${AOA_LLAMACPP_THREADS_BATCH:-4}"
+      LLAMA_ARG_THREADS_HTTP: "${AOA_LLAMACPP_THREADS_HTTP:-2}"
+      LLAMA_ARG_PARALLEL: "${AOA_LLAMACPP_PARALLEL:-1}"
+      LLAMA_ARG_BATCH_SIZE: "${AOA_LLAMACPP_BATCH_SIZE:-512}"
+      LLAMA_ARG_UBATCH_SIZE: "${AOA_LLAMACPP_UBATCH_SIZE:-128}"
+      LLAMA_ARG_N_GPU_LAYERS: "${AOA_LLAMACPP_N_GPU_LAYERS:-0}"
+      LLAMA_ARG_DEVICE: "${AOA_LLAMACPP_DEVICE:-none}"
+      LLAMA_ARG_ENDPOINT_METRICS: "${AOA_LLAMACPP_ENDPOINT_METRICS:-1}"
+      LLAMA_ARG_JINJA: "${AOA_LLAMACPP_JINJA:-1}"
+      LLAMA_ARG_REASONING: "${AOA_LLAMACPP_REASONING:-off}"
+      LLAMA_ARG_THINK: "${AOA_LLAMACPP_THINK:-none}"
+      LLAMA_ARG_NO_OP_OFFLOAD: "${AOA_LLAMACPP_NO_OP_OFFLOAD:-1}"
+      LLAMA_ARG_NO_WARMUP: "${AOA_LLAMACPP_NO_WARMUP:-1}"
+    volumes:
+      - "${AOA_LLAMACPP_MODEL_HOST_PATH:-/srv/abyss-stack/Logs/llamacpp/missing-model.gguf}:/models/qwen3.5-9b.gguf:ro,Z"
+    ports:
+      - "127.0.0.1:${AOA_LLAMACPP_HOST_PORT:-11435}:8080"
diff --git a/compose/modules/44-llamacpp-agent-sidecar.yml b/compose/modules/44-llamacpp-agent-sidecar.yml
@@ -0,0 +1,32 @@
+services:
+  langchain-api-llamacpp:
+    build: "${AOA_STACK_ROOT:-/srv/abyss-stack}/Services/langchain-api"
+    container_name: langchain-api-llamacpp
+    env_file:
+      - "${AOA_STACK_ROOT:-/srv/abyss-stack}/Secrets/Configs/langchain-api.env"
+    environment:
+      LC_BASE_URL: http://llama-cpp:8080/v1
+      LC_API_KEY: EMPTY
+      LC_MODEL: "${AOA_LLAMACPP_MODEL_ALIAS:-qwen3.5:9b}"
+      LC_TIMEOUT_S: 300
+      LC_OLLAMA_NATIVE_CHAT: "false"
+      LC_OPENAI_LITERAL_COMPLETIONS: "true"
+      AOA_RETURN_ENABLED: "${AOA_RETURN_ENABLED:-true}"
+      AOA_RETURN_POLICY_PATH: "${AOA_RETURN_POLICY_PATH:-/app/config/return-policy.yaml}"
+      AOA_RETURN_LOG_ROOT: "${AOA_RETURN_LOG_ROOT:-/app/logs/returns-llamacpp}"
+      AOA_FEDERATED_RUN_ENABLED: "false"
+      EMBEDDINGS_PROVIDER: ovms
+      OVMS_EMBEDDINGS_URL: http://host.containers.internal:8200/v3/embeddings
+      OVMS_EMBEDDINGS_MODEL: qwen3-embed-0.6b-int8-ov
+    volumes:
+      - "${AOA_STACK_ROOT:-/srv/abyss-stack}/Configs/agent-api/return-policy.yaml:/app/config/return-policy.yaml:ro,Z"
+      - "${AOA_STACK_ROOT:-/srv/abyss-stack}/Logs/returns-llamacpp:/app/logs/returns-llamacpp:Z"
+    ports:
+      - "127.0.0.1:${AOA_LLAMACPP_LANGCHAIN_HOST_PORT:-5403}:5401"
+    healthcheck:
+      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:5401/health', timeout=2).read()"]
+      interval: 5s
+      timeout: 3s
+      retries: 12
+      start_period: 5s
+    restart: unless-stopped
diff --git a/config-templates/Services/langchain-api/app/main.py b/config-templates/Services/langchain-api/app/main.py
@@ -1,5 +1,6 @@
 import json
 import os
+import re
 import urllib.error
 import urllib.request
 from pathlib import Path
@@ -18,6 +19,9 @@
 
 app = FastAPI()
 
+THINK_TAG_PREFIX_RE = re.compile(r"^\s*<think>.*?</think>\s*", re.DOTALL)
+LITERAL_REPLY_PROMPT_RE = re.compile(r"^Reply exactly with:\s*(.+?)\s*$", re.DOTALL)
+
 BASE_URL = os.getenv("LC_BASE_URL", "http://ollama:11434/v1").rstrip("/")
 API_KEY = os.getenv("LC_API_KEY", "EMPTY")
 MODEL = os.getenv("LC_MODEL", "qwen3.5:9b")
@@ -29,6 +33,10 @@
     "yes",
     "on",
 }
+OPENAI_LITERAL_COMPLETIONS = os.getenv(
+    "LC_OPENAI_LITERAL_COMPLETIONS",
+    "false",
+).strip().lower() in {"1", "true", "yes", "on"}
 OLLAMA_NATIVE_CHAT_URL = os.getenv(
     "LC_OLLAMA_NATIVE_CHAT_URL",
     "http://ollama:11434/api/chat",
@@ -209,6 +217,18 @@ def _http_post_json(
     return parsed
 
 
+def _http_auth_headers() -> dict[str, str] | None:
+    if not API_KEY:
+        return None
+    return {"Authorization": f"Bearer {API_KEY}"}
+
+
+def _llamacpp_completion_url() -> str:
+    if BASE_URL.endswith("/v1"):
+        return f"{BASE_URL[:-3]}/completion"
+    return f"{BASE_URL}/completion"
+
+
 def _route_api_post(path: str, payload: dict[str, Any]) -> dict[str, Any]:
     url = f"{ROUTE_API_BASE_URL}{path}"
     req = urllib.request.Request(
@@ -368,13 +388,106 @@ def _ollama_chat(req: RunReq) -> dict[str, Any]:
     return {"ok": True, "backend": "ollama-native", "model": MODEL, "answer": content}
 
 
+def _flatten_response_content(content: Any) -> str:
+    if isinstance(content, str):
+        return content
+    if isinstance(content, list):
+        chunks: list[str] = []
+        for item in content:
+            if isinstance(item, str):
+                chunks.append(item)
+                continue
+            if isinstance(item, dict) and item.get("type") == "text" and isinstance(item.get("text"), str):
+                chunks.append(item["text"])
+        return "".join(chunks)
+    return ""
+
+
+def _normalize_answer_text(content: Any) -> str:
+    text = _flatten_response_content(content).strip()
+    while text:
+        updated = THINK_TAG_PREFIX_RE.sub("", text, count=1).strip()
+        if updated == text:
+            break
+        text = updated
+    return text
+
+
+def _literal_reply_target(req: RunReq) -> str | None:
+    if not OPENAI_LITERAL_COMPLETIONS:
+        return None
+    if float(req.temperature) != 0.0:
+        return None
+    if int(req.max_tokens) > 16:
+        return None
+    match = LITERAL_REPLY_PROMPT_RE.fullmatch(req.user_text.strip())
+    if not match:
+        return None
+    target = match.group(1).strip()
+    if not target or len(target) > 160:
+        return None
+    return target
+
+
+def _openai_completion(req: RunReq) -> dict[str, Any]:
+    text = ""
+    try:
+        native_payload = {
+            "model": MODEL,
+            "prompt": req.user_text,
+            "temperature": float(req.temperature),
+            "n_predict": int(req.max_tokens),
+        }
+        native_data = _http_post_json(
+            _llamacpp_completion_url(),
+            native_payload,
+            TIMEOUT,
+            headers=_http_auth_headers(),
+        )
+        native_text = native_data.get("content")
+        if isinstance(native_text, str):
+            text = native_text
+    except RuntimeError:
+        text = ""
+
+    if not text:
+        payload = {
+            "model": MODEL,
+            "prompt": req.user_text,
+            "temperature": float(req.temperature),
+            "max_tokens": int(req.max_tokens),
+        }
+        data = _http_post_json(
+            f"{BASE_URL}/completions",
+            payload,
+            TIMEOUT,
+            headers=_http_auth_headers(),
+        )
+        choices = data.get("choices")
+        if isinstance(choices, list) and choices:
+            first = choices[0]
+            if isinstance(first, dict):
+                text = str(first.get("text") or "")
+    if not isinstance(text, str) or not text:
+        raise RuntimeError("unexpected_openai_completion_response: missing text")
+    return {
+        "ok": True,
+        "backend": "langchain",
+        "model": MODEL,
+        "answer": _normalize_answer_text(text),
+    }
+
+
 def _invoke_run_backend(req: RunReq) -> dict[str, Any]:
     if OLLAMA_NATIVE_CHAT and ("litellm" in BASE_URL or "ollama" in BASE_URL):
         return _ollama_chat(req)
 
     if ChatOpenAI is None or HumanMessage is None:
         raise RuntimeError("langchain_openai dependencies are not installed")
 
+    if _literal_reply_target(req) is not None:
+        return _openai_completion(req)
+
     llm_kwargs: dict[str, Any] = {
         "model": MODEL,
         "base_url": BASE_URL,
@@ -402,7 +515,12 @@ def _invoke_run_backend(req: RunReq) -> dict[str, Any]:
 
     llm = ChatOpenAI(**llm_kwargs)
     resp = llm.invoke([HumanMessage(content=req.user_text)])
-    return {"ok": True, "backend": "langchain", "model": MODEL, "answer": (resp.content or "")}
+    return {
+        "ok": True,
+        "backend": "langchain",
+        "model": MODEL,
+        "answer": _normalize_answer_text(resp.content),
+    }
 
 
 def _effective_profile_class(profile_class: PROFILE_CLASS | None) -> PROFILE_CLASS:

diff --git a/docs/FIRST_RUN.md b/docs/FIRST_RUN.md
@@ -149,6 +149,17 @@ scripts/aoa-local-ai-trials run-wave W0
 That flow keeps machine-readable trial truth under `Logs/local-ai-trials/` and writes Markdown mirrors to `Dionysus/reports/local-ai-trials/`.
 Use [LOCAL_AI_TRIALS](LOCAL_AI_TRIALS.md) for the full contract.
 
+## Optional llama.cpp backend-parity pilot
+
+If you want to compare a bounded `llama.cpp` sidecar against the current validated Ollama path without replacing the canonical runtime:
+
+```bash
+scripts/aoa-llamacpp-pilot run --preset intel-full
+```
+
+That pilot resolves the resident Ollama GGUF blob, starts `llama-cpp` on a separate host port, exposes a sidecar `langchain-api-llamacpp` on `127.0.0.1:5403`, and writes comparison artifacts under `${AOA_STACK_ROOT}/Logs/runtime-benchmarks/comparisons/`.
+Use [LLAMACPP_PILOT](LLAMACPP_PILOT.md) for the full contract.
+
 ## Compose optional layers manually
 
 ### Agent runtime plus tools