Feat: Lightweight downstream Zero-Shot Evaluation (Tiny-Eval)

### Problem
The success metric for the agent is currently pure `val_bpb` (validation bits-per-byte). While a great proxy, isolated cross-entropy loss can sometimes disguise models that have collapsed in specific linguistic capabilities or are cheating via strange formatting.

### Proposal
Integrate a very lightweight zero-shot evaluation loop at the end of the `TIME_BUDGET`. This could test against miniature versions of MMLU, Hellaswag, or a simple QA format. Emitting a real "accuracy" percentage in `results.tsv` alongside `val_bpb` will give the AI a secondary, qualitative anchor to ensure its architectural hacks are improving actual model reasoning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Lightweight downstream Zero-Shot Evaluation (Tiny-Eval) #122

Problem

Proposal

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feat: Lightweight downstream Zero-Shot Evaluation (Tiny-Eval) #122

Description

Problem

Proposal

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions