config: switch primary LLM to local Ollama (qwen3.5 on M3 Max)

kbichave · kbichave · commit 6323df1600e0 · 2026-03-15T20:21:57.000-04:00
- LLM_PROVIDER=ollama; per-tier: IC/decoder→qwen3.5:9b, pod/assistant→qwen3.5:35b-a3b
- Workshop sessions keep Bedrock Sonnet via LLM_MODEL_WORKSHOP (cloud reasoning quality)
- llm_config.py: tier-aware _model_ollama, _build_ollama_llm returns crewai.LLM with
  api_base + extra_body={"think": False} to disable Qwen extended-thinking for agents
- New get_llm_for_role() for explicit role-based resolution outside CrewAI context
- New workshop tier in _TIER_ENV_OVERRIDE / _classify_tier
- scripts/check_ollama_health.py: stdlib health check (server, pulled, in-memory);
  auto-preloads models; exit 0 = healthy
- /trade and /workshop Step 0 now gate on health check before any crew calls
- CLAUDE.md LLM section updated with model tables, cost breakdown, launchd config

Peak memory budget: ~36 GB of 128 GB (6 GB 9b + 20 GB 35b-a3b + 10 GB KV cache).
Cost: $0.00/crew run vs ~$0.03 on Bedrock.
diff --git a/.claude/memory/session_handoffs.md b/.claude/memory/session_handoffs.md
@@ -8,6 +8,30 @@
 
 | Date | From Session | To Session | Key Context |
 |------|-------------|------------|-------------|
+| 2026-03-15 | config | all | Local Ollama set as primary LLM provider. See self-modification log below. |
+
+---
+
+## Self-Modification Log
+
+### 2026-03-15 — Local Ollama setup (M3 Max 128GB)
+
+**Files modified:**
+- `packages/quant_pod/llm_config.py` — Added `workshop` tier, tier-aware `_model_ollama`, `_build_ollama_llm` helper (injects `api_base` + `extra_body={"think": False}`), `get_llm_for_role()` public function. `get_llm_for_agent` now returns `crewai.LLM` object for Ollama models instead of plain string.
+- `.env` — Switched `LLM_PROVIDER=ollama`. Set per-tier overrides: IC/decoder → qwen3.5:9b, pod/assistant → qwen3.5:35b-a3b. Workshop → bedrock/claude-sonnet-4. Fallback chain: bedrock,openai.
+- `.env.example` — Added Ollama section with comments.
+- `scripts/check_ollama_health.py` — New script (stdlib only). Checks server reachability, pulled models, loaded models, auto-preloads if needed. Exit 0 = healthy.
+- `.claude/skills/trade.md` — Step 0 now runs health check first; abort if models not loaded.
+- `.claude/skills/workshop.md` — Step 0 now runs health check + AWS credential verify.
+- `CLAUDE.md` — LLM Configuration section replaced with local model tables, cost breakdown, health check instructions, launchd startup config.
+
+**Why:** M3 Max has 128GB unified memory. Two models (~36GB peak total) run permanently at 0 cost vs ~$0.03/crew run on Bedrock. NUM_PARALLEL=10 enables all 10 ICs to hit qwen3.5:9b simultaneously. Thinking mode disabled for agents via `extra_body={"think": False}` — saves tokens/latency with no quality loss for focused IC work. Workshop sessions keep Bedrock Sonnet for deep reasoning quality.
+
+**Still needed (manual steps — Ollama not running at time of config):**
+1. `ollama serve` or start Ollama app
+2. `ollama pull qwen3.5:9b` and `ollama pull qwen3.5:35b-a3b`
+3. Set launchd env vars: KEEP_ALIVE=-1, FLASH_ATTENTION=1, NUM_PARALLEL=10, then restart Ollama
+4. Run `python scripts/check_ollama_health.py` to verify
 (empty)
 
 ## Self-Modification Log
diff --git a/.claude/skills/trade.md b/.claude/skills/trade.md
@@ -12,8 +12,11 @@ and (when execution is enabled) place trades.
 
 ## Workflow
 
-### Step 0: Read Context
-Before any tool calls, load your persistent memory:
+### Step 0: Read Context + Model Health Check
+Before any tool calls:
+- Run `python scripts/check_ollama_health.py` — abort if models not loaded.
+  Both qwen3.5:9b and qwen3.5:35b-a3b must be resident in memory before crew runs.
+  If health check fails: do not proceed. Report the error and stop.
 - Read `.claude/memory/trade_journal.md` — last 5 trades for patterns
 - Read `.claude/memory/regime_history.md` — is regime transitioning?
 - Read `.claude/memory/session_handoffs.md` — relevant handoffs from other sessions?
diff --git a/.claude/skills/workshop.md b/.claude/skills/workshop.md
@@ -14,8 +14,14 @@ through the same lifecycle: hypothesis → backtest → walk-forward → registe
 
 ## Workflow
 
-### Step 0: Read Context
-Before any tool calls, load your persistent memory:
+### Step 0: Read Context + Infrastructure Check
+Before any tool calls:
+- Run `python scripts/check_ollama_health.py` — abort if models not loaded.
+  The trading crew used in backtesting depends on local models being resident.
+- Verify AWS credentials for Bedrock (used for deep workshop reasoning):
+  `aws sts get-caller-identity` — confirm it returns an ARN without error.
+  If credentials are expired, deep hypothesis reasoning will fall back to
+  Ollama (weaker). Note this in the session if degraded mode is active.
 - Read `.claude/memory/workshop_lessons.md` — don't repeat failed hypotheses
 - Read `.claude/memory/strategy_registry.md` — what strategies exist, what gaps remain
 - Read `.claude/memory/regime_history.md` — what's the current regime?
diff --git a/.env.example b/.env.example
@@ -95,8 +95,15 @@ OPENAI_MODEL=gpt-4o
 # MISTRAL_MODEL=mistral-large-latest
 
 # --- Ollama (Tier 3 — local, no API key required) ---
+# To use Ollama as primary provider, set LLM_PROVIDER=ollama and configure:
 # OLLAMA_BASE_URL=http://localhost:11434
-# OLLAMA_MODEL=llama3.3:70b
+# OLLAMA_MODEL=qwen3.5:35b-a3b           # pods + assistant (MoE, ~20GB)
+# OLLAMA_IC_MODEL=qwen3.5:9b             # ICs + decoder (dense, ~6GB)
+# LLM_MODEL_IC=ollama/qwen3.5:9b
+# LLM_MODEL_POD=ollama/qwen3.5:35b-a3b
+# LLM_MODEL_ASSISTANT=ollama/qwen3.5:35b-a3b
+# LLM_MODEL_DECODER=ollama/qwen3.5:9b
+# LLM_MODEL_WORKSHOP=bedrock/us.anthropic.claude-sonnet-4-20250514  # cloud for /workshop
 
 # --- Custom OpenAI-compatible (Tier 3 — vLLM, LM Studio, etc.) ---
 # CUSTOM_OPENAI_BASE_URL=http://localhost:8000/v1
@@ -125,6 +132,9 @@ OPENAI_MODEL=gpt-4o
 # Decoder ICs — pattern recognition, same as IC tier by default
 # LLM_MODEL_DECODER=gemini/gemini-2.5-flash
 
+# Workshop — deep strategy research (cloud model recommended, smarter than local)
+# LLM_MODEL_WORKSHOP=bedrock/us.anthropic.claude-sonnet-4-20250514
+
 # =============================================================================
 # OPTIONAL: Risk limits (override defaults)
 # =============================================================================
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -202,40 +202,76 @@ all model strings follow the `provider/model_id` format.
 4. ProviderConfigError if all fail
 ```
 
-### Agent tiers and default Bedrock models
+### Always Loaded (Ollama — M3 Max 128GB)
 
-| Tier | Agents | Bedrock default |
-|------|--------|-----------------|
-| `ic` | `*_ic` | `us.anthropic.claude-haiku-4-5-20251001-v1:0` |
-| `pod` | `*_pod_manager` | `us.anthropic.claude-sonnet-4-6` |
-| `assistant` | `trading_assistant`, `super_trader` | `us.anthropic.claude-sonnet-4-6` |
-| `decoder` | decoder crew agents | `us.anthropic.claude-haiku-4-5-20251001-v1:0` |
+| Model | Role | RAM | Active Params | Speed |
+|-------|------|-----|---------------|-------|
+| `qwen3.5:9b` | ICs (10), Decoder ICs (4) | ~6 GB | 9B dense | ~50 tok/s |
+| `qwen3.5:35b-a3b` | Pods (5), Assistant (1) | ~20 GB | 3B MoE | ~35 tok/s |
 
-### Cost presets (est. per full crew run)
+Peak crew memory: ~36 GB of 128 GB — no swap risk.
+`NUM_PARALLEL=10` handles all ICs hitting qwen3.5:9b simultaneously.
+Thinking mode disabled for all CrewAI agents (`think: false` via `extra_body`).
 
-| Setup | Config | Est. cost |
-|-------|--------|-----------|
-| Budget | `LLM_MODEL_IC=groq/llama-3.3-70b-versatile` + Gemini pods | ~$0.005 |
-| Balanced | Gemini ICs + Bedrock Sonnet pods | ~$0.02 |
-| Default | Haiku ICs + Sonnet pods (Bedrock) | ~$0.03 |
-| Premium | Haiku ICs + Sonnet pods + Opus assistant | ~$0.08 |
+### On Demand (Cloud)
+
+| Provider | Model | Role |
+|----------|-------|------|
+| AWS Bedrock | Claude Sonnet 4 | `/workshop` deep reasoning, fallback |
+| OpenAI | GPT-4o | secondary fallback |
+
+### Cost per Full Crew Run
+
+| Config | Cost | Notes |
+|--------|------|-------|
+| Local Ollama (current) | $0.00 | All ICs + pods local |
+| Bedrock fallback (Haiku ICs + Sonnet pods) | ~$0.02 | If Ollama down |
+| OpenAI GPT-4o (all) | ~$0.12 | Last resort fallback |
+| Workshop session (Bedrock Sonnet) | ~$0.02 | Deep hypothesis research |
+
+### Health Check
+
+Run `scripts/check_ollama_health.py` before any session.
+Both models must be loaded in memory. Script auto-preloads if pulled but not resident.
+
+### Agent tiers and model assignment
+
+| Tier | Agents | Local model | Env override |
+|------|--------|-------------|-------------|
+| `ic` | `*_ic` | `ollama/qwen3.5:9b` | `LLM_MODEL_IC` |
+| `pod` | `*_pod_manager` | `ollama/qwen3.5:35b-a3b` | `LLM_MODEL_POD` |
+| `assistant` | `trading_assistant`, `super_trader` | `ollama/qwen3.5:35b-a3b` | `LLM_MODEL_ASSISTANT` |
+| `decoder` | decoder crew agents | `ollama/qwen3.5:9b` | `LLM_MODEL_DECODER` |
+| `workshop` | (not a CrewAI agent — use `get_llm_for_role("workshop")`) | bedrock | `LLM_MODEL_WORKSHOP` |
 
 ### Key env vars
 
 ```bash
-LLM_PROVIDER=bedrock
-LLM_FALLBACK_CHAIN=anthropic,openai
-
-# Bedrock
+LLM_PROVIDER=ollama
+OLLAMA_BASE_URL=http://localhost:11434
+OLLAMA_MODEL=qwen3.5:35b-a3b
+LLM_FALLBACK_CHAIN=bedrock,openai
+
+# Per-tier overrides (active in .env):
+LLM_MODEL_IC=ollama/qwen3.5:9b
+LLM_MODEL_POD=ollama/qwen3.5:35b-a3b
+LLM_MODEL_ASSISTANT=ollama/qwen3.5:35b-a3b
+LLM_MODEL_DECODER=ollama/qwen3.5:9b
+LLM_MODEL_WORKSHOP=bedrock/us.anthropic.claude-sonnet-4-20250514
+
+# Cloud fallback
 BEDROCK_REGION=us-east-1
-BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-6
+BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514
 # AWS_PROFILE=DataScience.Admin-Analytics
+```
 
-# Per-tier overrides (any LiteLLM model string, take precedence over everything):
-# LLM_MODEL_IC=bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0
-# LLM_MODEL_POD=bedrock/us.anthropic.claude-sonnet-4-6
-# LLM_MODEL_ASSISTANT=bedrock/us.anthropic.claude-opus-4-6-v1
-# LLM_MODEL_DECODER=gemini/gemini-2.5-flash
+### Ollama startup (launchd — required for concurrency)
+
+```bash
+launchctl setenv OLLAMA_KEEP_ALIVE -1       # models stay loaded permanently
+launchctl setenv OLLAMA_FLASH_ATTENTION 1   # Metal acceleration on Apple Silicon
+launchctl setenv OLLAMA_NUM_PARALLEL 10     # 10 parallel IC requests at once
+# Then restart Ollama
 ```
 
 ---
diff --git a/packages/quant_pod/llm_config.py b/packages/quant_pod/llm_config.py
diff --git a/scripts/check_ollama_health.py b/scripts/check_ollama_health.py