Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .github/workflows/validate-stack.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,15 @@ jobs:
run: python scripts/validate_stack.py

- name: Python syntax check
run: python -m py_compile scripts/validate_stack.py scripts/aoa-host-facts
run: python -m py_compile scripts/validate_stack.py scripts/aoa-host-facts scripts/aoa-local-ai-trials scripts/aoa-langgraph-pilot scripts/aoa-w5-pilot scripts/aoa-w6-pilot scripts/aoa-llamacpp-pilot

- name: Shellcheck scripts
run: |
shellcheck \
scripts/aoa-lib.sh \
scripts/aoa-doctor \
scripts/aoa-install-layout \
scripts/aoa-sync-federation-surfaces \
scripts/aoa-sync-configs \
scripts/aoa-bootstrap-configs \
scripts/aoa-check-layout \
Expand Down Expand Up @@ -131,6 +132,11 @@ jobs:
export AOA_EXTRA_COMPOSE_FILES="compose/tuning/ollama.cpu.yml"
scripts/aoa-render-config --profile core >/dev/null

printf 'GGUFTEST' > "$RUNNER_TEMP/qwen3.5-9b.gguf"
export AOA_LLAMACPP_MODEL_HOST_PATH="$RUNNER_TEMP/qwen3.5-9b.gguf"
export AOA_EXTRA_COMPOSE_FILES="compose/modules/32-llamacpp-inference.yml,compose/modules/44-llamacpp-agent-sidecar.yml"
scripts/aoa-render-config --preset intel-full >/dev/null

- name: Capture host-facts artifacts
run: |
mkdir -p "$RUNNER_TEMP/host-facts"
Expand Down
56 changes: 31 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,31 +52,33 @@ This repository should not absorb:
7. Read [docs/PROFILE_RECIPES](docs/PROFILE_RECIPES.md).
8. Read [docs/RENDER_TRUTH](docs/RENDER_TRUTH.md).
9. Read [docs/RUNTIME_BENCH_POLICY](docs/RUNTIME_BENCH_POLICY.md).
10. Read [docs/INTERNAL_PROBES](docs/INTERNAL_PROBES.md).
11. Read [docs/PATHS](docs/PATHS.md).
12. Read [docs/WINDOWS_BRIDGE](docs/WINDOWS_BRIDGE.md).
13. Read [docs/WINDOWS_SETUP](docs/WINDOWS_SETUP.md).
14. Read [docs/WINDOWS_PERFORMANCE](docs/WINDOWS_PERFORMANCE.md).
15. Read [docs/STORAGE_LAYOUT](docs/STORAGE_LAYOUT.md).
16. Read [docs/REFERENCE_PLATFORM](docs/REFERENCE_PLATFORM.md).
17. Read [docs/REFERENCE_PLATFORM_SPEC](docs/REFERENCE_PLATFORM_SPEC.md).
18. Read [docs/MACHINE_FIT_POLICY](docs/MACHINE_FIT_POLICY.md).
19. Read [docs/PLATFORM_ADAPTATION_POLICY](docs/PLATFORM_ADAPTATION_POLICY.md).
20. Read [docs/BRANCH_POLICY](docs/BRANCH_POLICY.md).
21. Read [docs/MEMO_RUNTIME_SEAM](docs/MEMO_RUNTIME_SEAM.md).
22. Read [docs/EVAL_RUNTIME_SEAM](docs/EVAL_RUNTIME_SEAM.md).
23. Read [docs/PLAYBOOK_RUNTIME_SEAM](docs/PLAYBOOK_RUNTIME_SEAM.md).
24. Read [docs/MODEL_PROFILES](docs/MODEL_PROFILES.md).
25. Read [docs/CONTEXT_BUDGET_POLICY](docs/CONTEXT_BUDGET_POLICY.md).
26. Read [docs/RECURRENCE_RUNTIME_POLICY](docs/RECURRENCE_RUNTIME_POLICY.md).
27. Read [docs/DEPLOYMENT](docs/DEPLOYMENT.md).
28. Read [docs/FIRST_RUN](docs/FIRST_RUN.md).
29. Read [docs/DOCTOR](docs/DOCTOR.md).
30. Read [docs/SECRETS_BOOTSTRAP](docs/SECRETS_BOOTSTRAP.md).
31. Read [docs/LIFECYCLE](docs/LIFECYCLE.md).
32. Read [docs/RUNBOOK](docs/RUNBOOK.md).
33. Read [docs/SECURITY](docs/SECURITY.md).
34. Read [docs/MIGRATION_FROM_OLD](docs/MIGRATION_FROM_OLD.md).
10. Read [docs/LLAMACPP_PILOT](docs/LLAMACPP_PILOT.md).
11. Read [docs/LOCAL_AI_TRIALS](docs/LOCAL_AI_TRIALS.md).
12. Read [docs/INTERNAL_PROBES](docs/INTERNAL_PROBES.md).
13. Read [docs/PATHS](docs/PATHS.md).
14. Read [docs/WINDOWS_BRIDGE](docs/WINDOWS_BRIDGE.md).
15. Read [docs/WINDOWS_SETUP](docs/WINDOWS_SETUP.md).
16. Read [docs/WINDOWS_PERFORMANCE](docs/WINDOWS_PERFORMANCE.md).
17. Read [docs/STORAGE_LAYOUT](docs/STORAGE_LAYOUT.md).
18. Read [docs/REFERENCE_PLATFORM](docs/REFERENCE_PLATFORM.md).
19. Read [docs/REFERENCE_PLATFORM_SPEC](docs/REFERENCE_PLATFORM_SPEC.md).
20. Read [docs/MACHINE_FIT_POLICY](docs/MACHINE_FIT_POLICY.md).
21. Read [docs/PLATFORM_ADAPTATION_POLICY](docs/PLATFORM_ADAPTATION_POLICY.md).
22. Read [docs/BRANCH_POLICY](docs/BRANCH_POLICY.md).
23. Read [docs/MEMO_RUNTIME_SEAM](docs/MEMO_RUNTIME_SEAM.md).
24. Read [docs/EVAL_RUNTIME_SEAM](docs/EVAL_RUNTIME_SEAM.md).
25. Read [docs/PLAYBOOK_RUNTIME_SEAM](docs/PLAYBOOK_RUNTIME_SEAM.md).
26. Read [docs/MODEL_PROFILES](docs/MODEL_PROFILES.md).
27. Read [docs/CONTEXT_BUDGET_POLICY](docs/CONTEXT_BUDGET_POLICY.md).
28. Read [docs/RECURRENCE_RUNTIME_POLICY](docs/RECURRENCE_RUNTIME_POLICY.md).
29. Read [docs/DEPLOYMENT](docs/DEPLOYMENT.md).
30. Read [docs/FIRST_RUN](docs/FIRST_RUN.md).
31. Read [docs/DOCTOR](docs/DOCTOR.md).
32. Read [docs/SECRETS_BOOTSTRAP](docs/SECRETS_BOOTSTRAP.md).
33. Read [docs/LIFECYCLE](docs/LIFECYCLE.md).
34. Read [docs/RUNBOOK](docs/RUNBOOK.md).
35. Read [docs/SECURITY](docs/SECURITY.md).
36. Read [docs/MIGRATION_FROM_OLD](docs/MIGRATION_FROM_OLD.md).

For the shortest next route by intent:
- if you need the ecosystem center, layer map, or federation rules, go to [`Agents-of-Abyss`](https://github.com/8Dionysus/Agents-of-Abyss)
Expand All @@ -89,6 +91,8 @@ For the shortest next route by intent:
- if you need playbook meaning, activation doctrine, or authored execution bundles, go to [`aoa-playbooks`](https://github.com/8Dionysus/aoa-playbooks)
- if you need the Windows host and WSL bridge workflow, read [docs/WINDOWS_BRIDGE](docs/WINDOWS_BRIDGE.md), [docs/WINDOWS_SETUP](docs/WINDOWS_SETUP.md), and [docs/WINDOWS_PERFORMANCE](docs/WINDOWS_PERFORMANCE.md)
- if you need runtime benchmark ownership, storage, and manifest rules, read [docs/RUNTIME_BENCH_POLICY](docs/RUNTIME_BENCH_POLICY.md)
- if you need the bounded llama.cpp A/B runtime pilot next to the validated Ollama path, read [docs/LLAMACPP_PILOT](docs/LLAMACPP_PILOT.md)
- if you need bounded local-model trial contracts, W4 supervised edits, or the promoted W5/W6 local-worker path, read [docs/LOCAL_AI_TRIALS](docs/LOCAL_AI_TRIALS.md)
- if you need normative host posture or machine-readable host-facts capture, read [docs/REFERENCE_PLATFORM](docs/REFERENCE_PLATFORM.md) and [docs/REFERENCE_PLATFORM_SPEC](docs/REFERENCE_PLATFORM_SPEC.md)
- if you need to tune the runtime to the current machine, confirm driver freshness, or decide which preset the host should prefer, read [docs/MACHINE_FIT_POLICY](docs/MACHINE_FIT_POLICY.md)
- if you need a compact record of platform-specific quirks, adaptations, and portability notes, read [docs/PLATFORM_ADAPTATION_POLICY](docs/PLATFORM_ADAPTATION_POLICY.md)
Expand Down Expand Up @@ -145,9 +149,11 @@ The stack is organized around explicit compose modules rather than one swollen f
- `20-orchestration.yml`
- `30-local-inference.yml`
- `31-intel-inference.yml`
- `32-llamacpp-inference.yml`
- `40-llm-gateway.yml`
- `41-agent-api.yml`
- `42-agent-api-intel.yml`
- `44-llamacpp-agent-sidecar.yml`
- `50-speech.yml`
- `51-browser-tools.yml`
- `60-monitoring.yml`
Expand Down
11 changes: 11 additions & 0 deletions compose/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@ The new stack uses small compose modules, named profiles, and named presets.
- `modules/20-orchestration.yml`
- `modules/30-local-inference.yml`
- `modules/31-intel-inference.yml`
- `modules/32-llamacpp-inference.yml`
- `modules/40-llm-gateway.yml`
- `modules/41-agent-api.yml`
- `modules/42-agent-api-intel.yml`
- `modules/44-llamacpp-agent-sidecar.yml`
- `modules/50-speech.yml`
- `modules/51-browser-tools.yml`
- `modules/60-monitoring.yml`
Expand Down Expand Up @@ -38,6 +40,15 @@ A profile is only a list of module filenames in activation order.

A preset is a list of profile names in activation order.

## Optional pilot modules

`32-llamacpp-inference.yml` and `44-llamacpp-agent-sidecar.yml` are not part of the default profiles or presets.

They exist for the bounded `llama.cpp` sidecar pilot and are typically activated through:

- `scripts/aoa-llamacpp-pilot`
- or `AOA_EXTRA_COMPOSE_FILES` when you intentionally want the sidecar path

## Rule

New capability should arrive as:
Expand Down
33 changes: 33 additions & 0 deletions compose/modules/32-llamacpp-inference.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
services:
llama-cpp:
image: "${AOA_LLAMACPP_IMAGE:-ghcr.io/ggml-org/llama.cpp:server-openvino}"
platform: linux/amd64
container_name: llama-cpp
restart: unless-stopped
cpus: "${AOA_LLAMACPP_CPUS:-4.0}"
mem_limit: "${AOA_LLAMACPP_MEM_LIMIT:-12g}"
mem_reservation: "${AOA_LLAMACPP_MEM_RESERVATION:-8g}"
environment:
LLAMA_ARG_MODEL: /models/qwen3.5-9b.gguf
LLAMA_ARG_ALIAS: "${AOA_LLAMACPP_MODEL_ALIAS:-qwen3.5:9b}"
LLAMA_ARG_HOST: 0.0.0.0
LLAMA_ARG_PORT: "8080"
LLAMA_ARG_CTX_SIZE: "${AOA_LLAMACPP_CTX_SIZE:-4096}"
LLAMA_ARG_THREADS: "${AOA_LLAMACPP_THREADS:-4}"
LLAMA_ARG_THREADS_BATCH: "${AOA_LLAMACPP_THREADS_BATCH:-4}"
LLAMA_ARG_THREADS_HTTP: "${AOA_LLAMACPP_THREADS_HTTP:-2}"
LLAMA_ARG_PARALLEL: "${AOA_LLAMACPP_PARALLEL:-1}"
LLAMA_ARG_BATCH_SIZE: "${AOA_LLAMACPP_BATCH_SIZE:-512}"
LLAMA_ARG_UBATCH_SIZE: "${AOA_LLAMACPP_UBATCH_SIZE:-128}"
LLAMA_ARG_N_GPU_LAYERS: "${AOA_LLAMACPP_N_GPU_LAYERS:-0}"
LLAMA_ARG_DEVICE: "${AOA_LLAMACPP_DEVICE:-none}"
LLAMA_ARG_ENDPOINT_METRICS: "${AOA_LLAMACPP_ENDPOINT_METRICS:-1}"
LLAMA_ARG_JINJA: "${AOA_LLAMACPP_JINJA:-1}"
LLAMA_ARG_REASONING: "${AOA_LLAMACPP_REASONING:-off}"
LLAMA_ARG_THINK: "${AOA_LLAMACPP_THINK:-none}"
LLAMA_ARG_NO_OP_OFFLOAD: "${AOA_LLAMACPP_NO_OP_OFFLOAD:-1}"
LLAMA_ARG_NO_WARMUP: "${AOA_LLAMACPP_NO_WARMUP:-1}"
volumes:
- "${AOA_LLAMACPP_MODEL_HOST_PATH:-/srv/abyss-stack/Logs/llamacpp/missing-model.gguf}:/models/qwen3.5-9b.gguf:ro,Z"
ports:
- "127.0.0.1:${AOA_LLAMACPP_HOST_PORT:-11435}:8080"
32 changes: 32 additions & 0 deletions compose/modules/44-llamacpp-agent-sidecar.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
services:
langchain-api-llamacpp:
build: "${AOA_STACK_ROOT:-/srv/abyss-stack}/Services/langchain-api"
container_name: langchain-api-llamacpp
env_file:
- "${AOA_STACK_ROOT:-/srv/abyss-stack}/Secrets/Configs/langchain-api.env"
environment:
LC_BASE_URL: http://llama-cpp:8080/v1
LC_API_KEY: EMPTY
LC_MODEL: "${AOA_LLAMACPP_MODEL_ALIAS:-qwen3.5:9b}"
LC_TIMEOUT_S: 300
LC_OLLAMA_NATIVE_CHAT: "false"
LC_OPENAI_LITERAL_COMPLETIONS: "true"
AOA_RETURN_ENABLED: "${AOA_RETURN_ENABLED:-true}"
AOA_RETURN_POLICY_PATH: "${AOA_RETURN_POLICY_PATH:-/app/config/return-policy.yaml}"
AOA_RETURN_LOG_ROOT: "${AOA_RETURN_LOG_ROOT:-/app/logs/returns-llamacpp}"
AOA_FEDERATED_RUN_ENABLED: "false"
EMBEDDINGS_PROVIDER: ovms
OVMS_EMBEDDINGS_URL: http://host.containers.internal:8200/v3/embeddings
OVMS_EMBEDDINGS_MODEL: qwen3-embed-0.6b-int8-ov
volumes:
- "${AOA_STACK_ROOT:-/srv/abyss-stack}/Configs/agent-api/return-policy.yaml:/app/config/return-policy.yaml:ro,Z"
- "${AOA_STACK_ROOT:-/srv/abyss-stack}/Logs/returns-llamacpp:/app/logs/returns-llamacpp:Z"
ports:
- "127.0.0.1:${AOA_LLAMACPP_LANGCHAIN_HOST_PORT:-5403}:5401"
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:5401/health', timeout=2).read()"]
interval: 5s
timeout: 3s
retries: 12
start_period: 5s
restart: unless-stopped
120 changes: 119 additions & 1 deletion config-templates/Services/langchain-api/app/main.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import json
import os
import re
import urllib.error
import urllib.request
from pathlib import Path
Expand All @@ -18,6 +19,9 @@

app = FastAPI()

THINK_TAG_PREFIX_RE = re.compile(r"^\s*<think>.*?</think>\s*", re.DOTALL)
LITERAL_REPLY_PROMPT_RE = re.compile(r"^Reply exactly with:\s*(.+?)\s*$", re.DOTALL)

BASE_URL = os.getenv("LC_BASE_URL", "http://ollama:11434/v1").rstrip("/")
API_KEY = os.getenv("LC_API_KEY", "EMPTY")
MODEL = os.getenv("LC_MODEL", "qwen3.5:9b")
Expand All @@ -29,6 +33,10 @@
"yes",
"on",
}
OPENAI_LITERAL_COMPLETIONS = os.getenv(
"LC_OPENAI_LITERAL_COMPLETIONS",
"false",
).strip().lower() in {"1", "true", "yes", "on"}
OLLAMA_NATIVE_CHAT_URL = os.getenv(
"LC_OLLAMA_NATIVE_CHAT_URL",
"http://ollama:11434/api/chat",
Expand Down Expand Up @@ -209,6 +217,18 @@ def _http_post_json(
return parsed


def _http_auth_headers() -> dict[str, str] | None:
if not API_KEY:
return None
return {"Authorization": f"Bearer {API_KEY}"}


def _llamacpp_completion_url() -> str:
if BASE_URL.endswith("/v1"):
return f"{BASE_URL[:-3]}/completion"
return f"{BASE_URL}/completion"


def _route_api_post(path: str, payload: dict[str, Any]) -> dict[str, Any]:
url = f"{ROUTE_API_BASE_URL}{path}"
req = urllib.request.Request(
Expand Down Expand Up @@ -368,13 +388,106 @@ def _ollama_chat(req: RunReq) -> dict[str, Any]:
return {"ok": True, "backend": "ollama-native", "model": MODEL, "answer": content}


def _flatten_response_content(content: Any) -> str:
if isinstance(content, str):
return content
if isinstance(content, list):
chunks: list[str] = []
for item in content:
if isinstance(item, str):
chunks.append(item)
continue
if isinstance(item, dict) and item.get("type") == "text" and isinstance(item.get("text"), str):
chunks.append(item["text"])
return "".join(chunks)
return ""


def _normalize_answer_text(content: Any) -> str:
text = _flatten_response_content(content).strip()
while text:
updated = THINK_TAG_PREFIX_RE.sub("", text, count=1).strip()
if updated == text:
break
text = updated
return text


def _literal_reply_target(req: RunReq) -> str | None:
if not OPENAI_LITERAL_COMPLETIONS:
return None
if float(req.temperature) != 0.0:
return None
if int(req.max_tokens) > 16:
return None
match = LITERAL_REPLY_PROMPT_RE.fullmatch(req.user_text.strip())
if not match:
return None
target = match.group(1).strip()
if not target or len(target) > 160:
return None
return target


def _openai_completion(req: RunReq) -> dict[str, Any]:
text = ""
try:
native_payload = {
"model": MODEL,
"prompt": req.user_text,
"temperature": float(req.temperature),
"n_predict": int(req.max_tokens),
}
native_data = _http_post_json(
_llamacpp_completion_url(),
native_payload,
TIMEOUT,
headers=_http_auth_headers(),
)
native_text = native_data.get("content")
if isinstance(native_text, str):
text = native_text
except RuntimeError:
text = ""

if not text:
payload = {
"model": MODEL,
"prompt": req.user_text,
"temperature": float(req.temperature),
"max_tokens": int(req.max_tokens),
}
data = _http_post_json(
f"{BASE_URL}/completions",
payload,
TIMEOUT,
headers=_http_auth_headers(),
)
choices = data.get("choices")
if isinstance(choices, list) and choices:
first = choices[0]
if isinstance(first, dict):
text = str(first.get("text") or "")
if not isinstance(text, str) or not text:
raise RuntimeError("unexpected_openai_completion_response: missing text")
return {
"ok": True,
"backend": "langchain",
"model": MODEL,
"answer": _normalize_answer_text(text),
}


def _invoke_run_backend(req: RunReq) -> dict[str, Any]:
if OLLAMA_NATIVE_CHAT and ("litellm" in BASE_URL or "ollama" in BASE_URL):
return _ollama_chat(req)

if ChatOpenAI is None or HumanMessage is None:
raise RuntimeError("langchain_openai dependencies are not installed")

if _literal_reply_target(req) is not None:
return _openai_completion(req)

llm_kwargs: dict[str, Any] = {
"model": MODEL,
"base_url": BASE_URL,
Expand Down Expand Up @@ -402,7 +515,12 @@ def _invoke_run_backend(req: RunReq) -> dict[str, Any]:

llm = ChatOpenAI(**llm_kwargs)
resp = llm.invoke([HumanMessage(content=req.user_text)])
return {"ok": True, "backend": "langchain", "model": MODEL, "answer": (resp.content or "")}
return {
"ok": True,
"backend": "langchain",
"model": MODEL,
"answer": _normalize_answer_text(resp.content),
}


def _effective_profile_class(profile_class: PROFILE_CLASS | None) -> PROFILE_CLASS:
Expand Down
11 changes: 11 additions & 0 deletions docs/FIRST_RUN.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,17 @@ scripts/aoa-local-ai-trials run-wave W0
That flow keeps machine-readable trial truth under `Logs/local-ai-trials/` and writes Markdown mirrors to `Dionysus/reports/local-ai-trials/`.
Use [LOCAL_AI_TRIALS](LOCAL_AI_TRIALS.md) for the full contract.

## Optional llama.cpp backend-parity pilot

If you want to compare a bounded `llama.cpp` sidecar against the current validated Ollama path without replacing the canonical runtime:

```bash
scripts/aoa-llamacpp-pilot run --preset intel-full
```

That pilot resolves the resident Ollama GGUF blob, starts `llama-cpp` on a separate host port, exposes a sidecar `langchain-api-llamacpp` on `127.0.0.1:5403`, and writes comparison artifacts under `${AOA_STACK_ROOT}/Logs/runtime-benchmarks/comparisons/`.
Use [LLAMACPP_PILOT](LLAMACPP_PILOT.md) for the full contract.

## Compose optional layers manually

### Agent runtime plus tools
Expand Down
Loading
Loading