Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,8 +91,8 @@ For the shortest next route by intent:
- if you need playbook meaning, activation doctrine, or authored execution bundles, go to [`aoa-playbooks`](https://github.com/8Dionysus/aoa-playbooks)
- if you need the Windows host and WSL bridge workflow, read [docs/WINDOWS_BRIDGE](docs/WINDOWS_BRIDGE.md), [docs/WINDOWS_SETUP](docs/WINDOWS_SETUP.md), and [docs/WINDOWS_PERFORMANCE](docs/WINDOWS_PERFORMANCE.md)
- if you need runtime benchmark ownership, storage, and manifest rules, read [docs/RUNTIME_BENCH_POLICY](docs/RUNTIME_BENCH_POLICY.md)
- if you need the bounded llama.cpp A/B runtime pilot next to the validated Ollama path, read [docs/LLAMACPP_PILOT](docs/LLAMACPP_PILOT.md)
- if you need bounded local-model trial contracts, W4 supervised edits, or the promoted W5/W6 local-worker path, read [docs/LOCAL_AI_TRIALS](docs/LOCAL_AI_TRIALS.md)
- if you need the promoted local Qwen runtime path on `5403`, the retained Ollama control path on `5401`, or the bounded `llama.cpp` comparison and promotion lineage, read [docs/LLAMACPP_PILOT](docs/LLAMACPP_PILOT.md)
- if you need bounded local-model trial contracts, the adopted LangGraph execution posture, or the promoted W5/W6 local-worker path, read [docs/LOCAL_AI_TRIALS](docs/LOCAL_AI_TRIALS.md)
- if you need normative host posture or machine-readable host-facts capture, read [docs/REFERENCE_PLATFORM](docs/REFERENCE_PLATFORM.md) and [docs/REFERENCE_PLATFORM_SPEC](docs/REFERENCE_PLATFORM_SPEC.md)
- if you need to tune the runtime to the current machine, confirm driver freshness, or decide which preset the host should prefer, read [docs/MACHINE_FIT_POLICY](docs/MACHINE_FIT_POLICY.md)
- if you need a compact record of platform-specific quirks, adaptations, and portability notes, read [docs/PLATFORM_ADAPTATION_POLICY](docs/PLATFORM_ADAPTATION_POLICY.md)
Expand Down Expand Up @@ -185,6 +185,9 @@ The repository now includes:
## Current status

`abyss-stack` is now a live multi-service runtime with stateful storage, local and Intel-aware inference paths, monitoring, host-facts capture, machine-fit capture, platform-adaptation logging, and landed federation advisory seams for sibling AoA repositories.
The current bounded local-worker posture is `llama.cpp`-first on `5403`, with Ollama retained on `5401` as the control and rollback path.
`LangGraph` is now the adopted execution layer for bounded long-horizon and autonomy-focused local-worker flows, while the earlier W0-W4 runner lineage remains available as the historical baseline.
The current Intel embeddings posture still uses OVMS; any move from OpenVINO serving to OpenVINO GenAI should be treated as a separate reviewed stack change.
The first live consumer step has now landed in `langchain-api` through opt-in `POST /run/federated`, which can consume advisory playbook and memo seams without changing the default `POST /run` path.
The next large step is no longer bootstrap or mirror landing, or whether the live runtime should consume those seams at all; it is deciding how broadly and how deeply the runtime loop should rely on those already-landed seams.

Expand Down
9 changes: 6 additions & 3 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,21 @@ Persistent state and retrieval substrate:

Workflow coordination and pipeline surfaces:
- n8n
- LangGraph for bounded local-worker execution, pause/resume, and milestone-gated recovery flows

### 3. Inference layer

Local and accelerator-aware model serving:
- Ollama
- OVMS and Intel-oriented model serving
- llama.cpp as the promoted local GGUF-serving path for bounded local-worker flows
- Ollama as the retained control and rollback path
- OVMS as the current Intel/OpenVINO-oriented serving path for embeddings
- a future OpenVINO GenAI migration as a separate stack change, not part of the current promoted path

### 4. Gateway and agent API layer

Model routing and agent-facing runtime APIs:
- LiteLLM
- LangChain API or successor service modules
- LangChain API service modules, including the control-path `langchain-api` surface and the promoted local-worker `langchain-api-llamacpp` surface

This layer may also host the runtime return wrapper that rebuilds context from a last valid anchor rather than continuing under drift.

Expand Down
14 changes: 10 additions & 4 deletions docs/LANGGRAPH_PILOT.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@

## Purpose

This document defines the bounded LangGraph sidecar pilot for `abyss-stack`.
This document defines the bounded LangGraph sidecar pilot for `abyss-stack` and records the execution-layer decision that came out of it.

It is not a new service and not a migration of `aoa-local-ai-trials`.
It is a comparison layer for one W4-shaped supervised edit flow.
It began as a comparison layer for one W4-shaped supervised edit flow and now serves as the origin surface for the adopted bounded execution layer used by `W5` and `W6`.

## Current pilot

Program id:
- `langgraph-sidecar-pilot-v1`
- `langgraph-sidecar-llamacpp-v1` for the disposable backend-promotion fixture gate

Current runtime path:
Current origin runtime path:
- `intel-full -> langchain-api /run -> ollama-native`

Current cases:
Expand Down Expand Up @@ -63,6 +63,11 @@ The sidecar pilot does not:
- replace `langchain-api /run`
- widen W4 into autonomous long-horizon execution

Current adopted role:
- `LangGraph` is the preferred bounded execution layer for `W5`, `W6`, and follow-on local-worker flows
- `aoa-local-ai-trials` remains the historical baseline for `W0` through `W4`
- `aoa-langgraph-pilot` remains the W4-shaped comparison and fixture surface

## Artifacts

Runtime truth:
Expand Down Expand Up @@ -93,4 +98,5 @@ The sidecar should answer a narrow question:
- does LangGraph improve pause/resume and recovery clarity for a bounded supervised edit flow
- without reducing W4 safety, scope discipline, or reportability

Until that answer is positive, the existing runner remains the execution baseline.
That answer is now positive for bounded local-worker flows.
Keep the sidecar pilot as the comparison and origin surface, and use the `W5` and `W6` contracts for the adopted execution posture.
40 changes: 25 additions & 15 deletions docs/LLAMACPP_PILOT.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,14 @@

## Purpose

This document defines the bounded `llama.cpp` sidecar pilot for `abyss-stack`.
This document defines the bounded `llama.cpp` sidecar pilot for `abyss-stack` and records the promoted runtime posture that came out of it.

It exists to answer a narrow question:
The pilot originally existed to answer a narrow question:

**does a `llama.cpp` sidecar improve the local Qwen runtime posture on this machine without replacing the validated canonical Ollama path yet?**

That question is now answered positively for the current bounded local-worker path.

## Boundary

The pilot is:
Expand All @@ -19,15 +21,22 @@ The pilot is:
The pilot is not:
- a silent replacement for the canonical local runtime
- a proof-layer quality verdict
- a claim that `llama.cpp` is already promoted into machine-fit canon
- a claim that every service in the stack should immediately move off the retained control path

## Current promoted posture

The current preferred bounded local-worker path is:

## Current default posture
`intel-full -> langchain-api-llamacpp /run -> llama.cpp + route-api`

The validated canonical path remains:
The retained control and rollback path remains:

`intel-full -> langchain-api /run -> litellm/ollama + route-api`
`intel-full -> langchain-api /run -> ollama-native + route-api`

The `llama.cpp` pilot is intentionally separate from that path until a reviewed promotion decision says otherwise.
The pilot script remains intentionally useful after promotion:
- to refresh bounded backend comparisons
- to verify the promoted sidecar posture
- to keep the control path honest without making it the default worker substrate

## What the pilot reuses

Expand All @@ -44,19 +53,19 @@ This keeps the pilot honest:
- same quantized resident artifact
- different serving runtime

## Pilot services
## Promoted and control services

When the pilot is active, it adds two localhost-only services:
When the promoted path is active, it uses two localhost-only services:

- `llama-cpp` -> `http://127.0.0.1:11435`
- `langchain-api-llamacpp` -> `http://127.0.0.1:5403/health`

The canonical services stay in place:
The control-path services stay in place:

- `ollama` -> `http://127.0.0.1:11434`
- `langchain-api` -> `http://127.0.0.1:5401/health`

That separation preserves honest A/B comparison.
That separation preserves honest A/B comparison, rollback, and future challenger evaluation.

## Operator commands

Expand Down Expand Up @@ -189,11 +198,12 @@ Promotion packets stay runtime-local too and capture:

A green or promising pilot does not automatically change the machine-fit record.

Promotion requires:
Promotion required:
- reviewed comparison output
- a clear recommendation that the sidecar is better for the intended bounded path
- an explicit update to machine-fit and the validated runtime docs

Until then:
- Ollama remains the validated preferred path
- `llama.cpp` remains an optional pilot substrate
Current result:
- `llama.cpp` is the preferred bounded local-worker path
- Ollama remains the validated control and rollback path
- any OpenVINO-side shift to OpenVINO GenAI should be reviewed separately from the `llama.cpp` promotion decision
14 changes: 9 additions & 5 deletions docs/LOCAL_AI_TRIALS.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Control baseline:
Promoted bounded-worker path:
- runtime path: `http://127.0.0.1:5403/run`
- backend: `llama.cpp`
- orchestration: `LangGraph` for `W5` and `W6`
- orchestration: `LangGraph` for `W5`, `W6`, and the current bounded local-worker posture

Durable program roots now in use:
- `qwen-local-pilot-v1`
Expand Down Expand Up @@ -126,11 +126,9 @@ What it does not do:
- it does not upgrade runtime success into portable proof wording
- it does not collapse `W4` into a silent monolithic mutator

## LangGraph sidecar pilot
## LangGraph sidecar origin and promoted role

The current trial runner remains the execution baseline.

An optional comparison layer now also exists:
The original comparison layer still exists:

```bash
scripts/aoa-langgraph-pilot materialize
Expand All @@ -146,6 +144,12 @@ scripts/aoa-langgraph-pilot --url http://127.0.0.1:5403/run --program-id langgra

Use [LANGGRAPH_PILOT](LANGGRAPH_PILOT.md) for the sidecar contract.

That sidecar surface established the now-adopted execution posture:

- `aoa-local-ai-trials` remains the historical baseline for `W0` through `W4`
- `LangGraph` is now the primary orchestration layer for `W5`, `W6`, and the current bounded local-worker path
- `aoa-langgraph-pilot` remains the W4-shaped comparison and fixture surface rather than the full execution baseline

## W5 long-horizon pilot

The next bounded scenario layer lives beside the earlier waves:
Expand Down
11 changes: 8 additions & 3 deletions docs/MACHINE_FIT_POLICY.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Use this layer for:
- preferred preset or profile selection for the current host
- current driver posture for visible accelerators
- package freshness for the host packages that matter to the runtime path
- validated local runtime settings such as bounded Ollama thread or batch posture
- validated local runtime settings such as bounded `llama.cpp` serving posture or control-path fallback settings
- warnings about noisy host envelopes that can distort latency-sensitive work
- compact refs to host facts, benchmark evidence, and adaptation records

Expand Down Expand Up @@ -140,5 +140,10 @@ scripts/aoa-machine-fit \

It does not own the global meaning of sibling AoA layers, and it does not replace runtime benchmarks or proof artifacts.

An optional runtime sidecar pilot, such as a bounded `llama.cpp` comparison, does not change the preferred machine-fit posture by itself.
Only a reviewed promotion decision should move a pilot path into the validated preferred runtime path.
A bounded runtime comparison by itself does not change the preferred machine-fit posture.
Only a reviewed promotion decision should move a candidate path into the validated preferred runtime path.

The current reviewed posture is:
- `llama.cpp` as the preferred bounded local-worker path on `5403`
- Ollama as the retained control and rollback path on `5401`
- the Intel embeddings path still on OVMS, with any OpenVINO GenAI migration handled as a separate reviewed change
9 changes: 6 additions & 3 deletions docs/RUNTIME_BENCH_POLICY.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,9 @@ scripts/aoa-qwen-bench --preset intel-full
This runner stays on the intended `langchain-api /run` path and writes machine-local evidence under `${AOA_STACK_ROOT}/Logs/runtime-benchmarks/runs/`.
It performs one uncounted warmup call per case before measured repeats so warm-latency reads stay warm by definition instead of by accident.

The default helper posture now targets the promoted local-worker path on `5403`.
Use an explicit `--url`, `--backend-label`, `--runtime-variant`, and `--target-label` when you want to refresh the retained Ollama control path on `5401`.

Refresh the durable catalog after new runs:

```bash
Expand Down Expand Up @@ -175,14 +178,14 @@ That helper may reuse runtime benchmark artifacts as evidence inside case packet

## Optional backend-parity pilot

For a bounded `llama.cpp` versus Ollama comparison on the same host and the same `langchain-api /run` contract, use:
For a bounded refresh of the promoted `llama.cpp` path against the retained Ollama control path on the same host and the same `langchain-api /run` contract, use:

```bash
scripts/aoa-llamacpp-pilot run --preset intel-full
```

That pilot runs a fresh Ollama baseline on `5401`, a fresh `llama.cpp` sidecar bench on `5403`, and writes a comparison packet under `${AOA_STACK_ROOT}/Logs/runtime-benchmarks/comparisons/`.
It is a runtime-parity aid, not a promotion decision by itself.
That pilot runs a fresh Ollama control bench on `5401`, a fresh `llama.cpp` sidecar bench on `5403`, and writes a comparison packet under `${AOA_STACK_ROOT}/Logs/runtime-benchmarks/comparisons/`.
It remains a runtime-parity and challenger-evaluation aid even after the current `llama.cpp` promotion.

Use the catalog layer to answer:
- what the latest baseline run was for a target label
Expand Down
25 changes: 17 additions & 8 deletions docs/SERVICE_CATALOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,24 +15,26 @@ This file maps the first migrated runtime modules to their intended services.

## `30-local-inference.yml`

- `ollama` — local LLM and embedding serving
- `ollama` — retained local control and rollback serving surface for Qwen chat and fallback embeddings

## `31-intel-inference.yml`

- `ovms` — Intel and OpenVINO oriented model serving
- `ovms` — current Intel and OpenVINO oriented model serving surface for embeddings
- any migration from OVMS/OpenVINO serving to OpenVINO GenAI is a separate reviewed stack change

## `32-llamacpp-inference.yml`

- `llama-cpp` — optional OpenAI-compatible GGUF serving sidecar for bounded backend-parity work
- reuses a resolved local GGUF model file rather than changing the canonical validated Ollama path
- `llama-cpp` — promoted OpenAI-compatible GGUF serving surface for bounded local-worker flows
- reuses a resolved local GGUF model file and now backs the preferred local Qwen worker path on `5403`
- keeps Ollama in place as the control and rollback path

## `40-llm-gateway.yml`

- `litellm` — model gateway and routing facade

## `41-agent-api.yml`

- `langchain-api` — base agent-facing runtime API
- `langchain-api` — base control-path agent-facing runtime API on `5401`
- default embeddings path — Ollama-first
- may consume a public-safe return policy file and emit runtime return events
- now also exposes opt-in `POST /run/federated` for live advisory consumption of `route-api` playbook and memo seams
Expand All @@ -45,9 +47,16 @@ This file maps the first migrated runtime modules to their intended services.

## `44-llamacpp-agent-sidecar.yml`

- `langchain-api-llamacpp` — optional sidecar agent API bound to a `llama.cpp` backend on a separate host port
- preserves the canonical `langchain-api` service and `5401` path for honest A/B comparison
- keeps embeddings on OVMS for Intel-aware pilot runs
- `langchain-api-llamacpp` — promoted bounded local-worker API bound to a `llama.cpp` backend on `5403`
- is the preferred local Qwen worker path for the current promoted `W5/W6` substrate
- preserves the base `langchain-api` service and `5401` path as the control and rollback surface
- keeps embeddings on OVMS for the current Intel-aware posture

## Execution layer

- `LangGraph` is now the adopted bounded execution layer for the `W5` and `W6` local-worker flows
- it remains a CLI-side execution surface rather than a long-running network service
- the original `aoa-langgraph-pilot` remains useful as the W4-shaped comparison and fixture surface

## `43-federation-router.yml`

Expand Down
21 changes: 14 additions & 7 deletions docs/machine-fit/machine-fit.public.json.example
Original file line number Diff line number Diff line change
Expand Up @@ -126,12 +126,19 @@
"tools",
"observability"
],
"preferred_runtime_path": "intel-full -> langchain-api /run -> litellm/ollama + route-api",
"validated_acceleration_posture": "OVMS embeddings on Intel GPU; Qwen chat via Ollama; Intel NPU is visible but not yet part of the validated canonical path.",
"preferred_runtime_path": "intel-full -> langchain-api-llamacpp /run -> llama.cpp + route-api",
"validated_acceleration_posture": "OVMS embeddings on Intel GPU; Qwen chat on the promoted llama.cpp path; Ollama remains the validated control path; Intel NPU is visible but not yet part of the validated canonical path.",
"validated_settings": {
"LC_OLLAMA_NUM_THREAD": "6",
"LC_OLLAMA_NUM_BATCH": "32",
"LC_OLLAMA_THINK": "false"
"AOA_LLAMACPP_DEVICE": "none",
"AOA_LLAMACPP_NO_OP_OFFLOAD": "1",
"AOA_LLAMACPP_THREADS": "4",
"AOA_LLAMACPP_THREADS_BATCH": "4",
"AOA_LLAMACPP_THREADS_HTTP": "2",
"AOA_LLAMACPP_CTX_SIZE": "4096",
"AOA_LLAMACPP_BATCH_SIZE": "512",
"AOA_LLAMACPP_UBATCH_SIZE": "128",
"AOA_LLAMACPP_REASONING": "off",
"AOA_LLAMACPP_THINK": "none"
},
"recommended_overlays": [],
"current_overlays": [],
Expand All @@ -148,7 +155,7 @@
},
"fit_verdict": {
"status": "qualified",
"summary": "Preferred preset is intel-full. Qwen chat should stay on langchain-api /run through the validated local path. Relevant host packages are current in the configured Fedora repositories.",
"summary": "Preferred preset is intel-full. Qwen chat should stay on the promoted llama.cpp path, with Ollama retained as the control path. Relevant host packages are current in the configured Fedora repositories.",
"next_actions": [
"Run scripts/aoa-doctor --preset intel-full before launch.",
"Refresh host facts when the host or kernel changes.",
Expand All @@ -158,7 +165,7 @@
"kernel update",
"linux-firmware update",
"mesa or Intel runtime update",
"Ollama or langchain-api runtime change",
"llama.cpp, Ollama control-path, or langchain-api runtime change",
"host load envelope change before latency-sensitive trials"
]
},
Expand Down
Loading
Loading