Fine-tune void on open model for self-hosted inference

## Goal

Replace void's Gemini 2.5 Pro with a fine-tuned open model (Llama 3.1 8B or similar) running on self-hosted infrastructure. Void has 49,786 posts and extensive cognition records. The fine-tune should capture void's voice, analytical style, and engagement patterns at a fraction of the inference cost.

## Data Available

**Posts**: 49,786 on comind.network PDS (99% replies with fetchable parent context)
- `app.bsky.feed.post` - public Bluesky posts
- `stream.thought.reasoning` - internal reasoning traces
- `stream.thought.memory` - episodic memory records
- `stream.thought.tool.call` - tool usage patterns

**Context window** (already exported to `data/void-context/`):
- System prompt: 5,618 chars
- 25 memory blocks: 64,561 chars total
- Key blocks: `void-persona` (5.1k), `operational_protocols` (18.8k), `communication_guidelines` (6k), `zeitgeist` (450)
- Prompt template from `~/code/void/bsky.py` handler

**Agent**: void-prime (`agent-01086cda-be1f-4986-bf3e-ca5b6297cc5d`) on Letta Cloud

## Pipeline (built, partially tested)

### 1. Export raw data
```bash
uv run python tools/export_training_data.py void.comind.network \
    -o data/void-raw.jsonl \
    --collections app.bsky.feed.post stream.thought.reasoning stream.thought.memory \
    --filter-chars
```
- Paginates PDS via `com.atproto.repo.listRecords`
- Fetches parent/root post context for every reply
- Filters character creation loop content (known failure mode: D&D-style sheets)
- Outputs JSONL with id, text, parent_text, parent_author, root_text, etc.
- **Estimated time**: hours (50k records + parent fetches with rate limiting)

### 2. Format for fine-tuning
```bash
uv run python tools/format_training_data.py data/void-raw.jsonl \
    -o data/void-finetune.jsonl \
    --system-prompt data/void-context/full-context.txt \
    --replies-only \
    --format sharegpt
```
- Injects void's actual context window as system prompt
- Reconstructs thread context as user messages
- Formats as chat completions (OpenAI, ShareGPT, or Alpaca)
- Filters short responses (<20 chars)

### 3. Fine-tune

**Base model candidates**:
| Model | Size | VRAM needed (QLoRA) | Notes |
|-------|------|-------------------|-------|
| Llama 3.1 8B Instruct | 8B | ~12GB | Best quality/cost ratio |
| Mistral 7B v0.3 | 7B | ~10GB | Good at conversation |
| Llama 3.2 3B | 3B | ~6GB | Cheapest, may lose nuance |
| Qwen 2.5 7B | 7B | ~10GB | Strong multilingual |

**Training approach**: QLoRA (4-bit quantization + LoRA adapters)
- Hardware: single A100 (80GB) or 4090 (24GB)
- Estimated training time: 2-4 hours on 40k+ pairs
- Framework: axolotl, unsloth, or huggingface TRL

### 4. Evaluation

This is the hardest part. Proposed approach:
- **Held-out test set**: 500 reply pairs void actually wrote, not seen during training
- **A/B comparison**: show test inputs to both fine-tuned model and base Llama, compare against void's actual response
- **Voice metrics**: response length distribution, vocabulary overlap, analytical depth (manual review of 50 samples)
- **Failure mode check**: does it generate character sheets? Does it break voice on edge cases?

### 5. Serve

Options:
- **vLLM** on a dedicated GPU instance (most performant)
- **llama.cpp** on CPU (cheapest, slower)
- **Ollama** for easy deployment
- **Together.ai / Fireworks** for managed inference (middle ground)

Then point void's handler at the new endpoint instead of Gemini.

## Known Issues

- **Character creation loop**: 46% of recent `stream.thought.memory` records are D&D character sheets. Must filter aggressively. Keywords list in `export_training_data.py`.
- **Context window size**: void's full context is 64k chars. Most 8B models have 8k-32k context. May need to trim to essential blocks (persona, operational_protocols, communication_guidelines) or use a long-context model.
- **Memory block drift**: void's blocks change over time. The exported context is a snapshot from 2026-02-13. Training data from 6 months ago had different blocks. Could cause distribution mismatch.
- **Tool calls**: void uses `add_post_to_bluesky_reply_thread` tool. Fine-tuned model needs to learn this tool-calling pattern, or we restructure the handler to extract text from the model and call the tool externally.

## Files

- `tools/export_training_data.py` - PDS export with parent fetching
- `tools/format_training_data.py` - Chat completion formatter
- `data/void-context/` - Exported context window (system prompt + 25 blocks)
- `data/void-sample.jsonl` - 20-record test sample
- `data/void-finetune-sample.jsonl` - Formatted sample

## Next Steps

1. [ ] Run full 50k export (long-running, background)
2. [ ] Decide on context window trimming strategy
3. [ ] Choose base model + training framework
4. [ ] Set up training environment (GPU instance)
5. [ ] Train + evaluate
6. [ ] Deploy and wire into void's handler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tune void on open model for self-hosted inference #67

Goal

Data Available

Pipeline (built, partially tested)

1. Export raw data

2. Format for fine-tuning

3. Fine-tune

4. Evaluation

5. Serve

Known Issues

Files

Next Steps

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model	Size	VRAM needed (QLoRA)	Notes
Llama 3.1 8B Instruct	8B	~12GB	Best quality/cost ratio
Mistral 7B v0.3	7B	~10GB	Good at conversation
Llama 3.2 3B	3B	~6GB	Cheapest, may lose nuance
Qwen 2.5 7B	7B	~10GB	Strong multilingual

Fine-tune void on open model for self-hosted inference #67

Description

Goal

Data Available

Pipeline (built, partially tested)

1. Export raw data

2. Format for fine-tuning

3. Fine-tune

4. Evaluation

5. Serve

Known Issues

Files

Next Steps

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions