Direct RLHF Feedback Loops for Local Model Fine-Tuning

## Summary

Generate preference datasets from user interactions that can be used to fine-tune local models via open-source RLHF pipelines.

## Problem

When users correct agent mistakes via the CLI, this valuable signal is lost. There's no mechanism to capture chosen vs rejected actions in a format that enables local model improvement over time.

## Proposal

Log success/failure trajectories in a format natively digestible by open-source RLHF pipelines (like OpenRLHF, TRL, or Axolotl):

- **Preference dataset generation**: When a user corrects an agent action (via `/undo`, manual file edit, or explicit "that's wrong"), capture the (prompt, chosen_response, rejected_response) triple
- **DPO/RLHF-ready format**: Output in standard formats (JSON, Parquet) compatible with Hugging Face datasets
- **Coding style adaptation**: Over time, the preference data captures the user's specific coding style, naming conventions, and architectural preferences
- **Local fine-tuning pipeline**: Provide a `selfware fine-tune` command that runs LoRA/QLoRA fine-tuning using collected preference data

## Implementation Ideas

- Hook into the existing audit logger to capture tool call sequences
- Detect "correction events": user undoes an edit, re-runs with different instructions, or explicitly rejects output
- Store preference pairs in `~/.selfware/feedback/preferences.jsonl`
- Format: `{"prompt": "...", "chosen": "...", "rejected": "...", "metadata": {...}}`
- Integration with Unsloth for efficient local fine-tuning
- Privacy-first: all data stays local, user controls what gets logged

## Example Output

```jsonl
{"prompt": "Add error handling to the parse function", "chosen": "fn parse(input: &str) -> Result<Value, ParseError> { ... }", "rejected": "fn parse(input: &str) -> Value { input.parse().unwrap() }", "task": "error_handling", "timestamp": "2026-03-09T12:00:00Z"}
```

## Relevant Code

- `src/safety/audit.rs` — JSONL audit logging (similar pattern)
- `src/session/edit_history.rs` — undo/redo tracking
- `src/cognitive/episodic.rs` — learning from past sessions
- `src/self_healing/` — error learning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct RLHF Feedback Loops for Local Model Fine-Tuning #57

Summary

Problem

Proposal

Implementation Ideas

Example Output

Relevant Code

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Direct RLHF Feedback Loops for Local Model Fine-Tuning #57

Description

Summary

Problem

Proposal

Implementation Ideas

Example Output

Relevant Code

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions