A Continually Updating, Self-Auditing Benchmark for LLM Reasoning, Tool Use, and Learning Over Time
Current LLM benchmarks suffer from three fundamental problems: benchmark saturation (models are optimized for fixed test sets), single-score reductionism (complex capabilities collapsed to one number), and gaming susceptibility (models exploit surface patterns rather than demonstrating genuine understanding).
LivingBench addresses these challenges through:
-
Dynamic Task Generation: Tasks are generated continuously from multiple sources (synthetic reasoning, code understanding, tool use scenarios), ensuring the benchmark never freezes.
-
Skill-Factorized Evaluation: Instead of single accuracy scores, we produce capability fingerprints—multi-dimensional profiles decomposing performance across cognitive skills.
-
Anti-Gaming Robustness Layer: Automatic generation of paraphrases, counterfactuals, and adversarial variants to detect memorization and spurious correlations.
-
Learning-Over-Time Measurement: Multi-session evaluation tracking error correction, knowledge retention, and performance trends.
-
Calibrated Multi-Judge Evaluation: Ensemble of judges with disagreement analysis and calibration checking—no blind trust in any single evaluator.
Modern LLMs are increasingly trained or fine-tuned with benchmark performance as a direct objective. This creates a fundamental tension: benchmarks designed to measure capability become targets for optimization, degrading their validity as capability measures (Goodhart's Law in action).
A model with 80% accuracy might excel at factual recall but fail at causal reasoning. Another 80%-accuracy model might show the opposite pattern. Single scores hide these critical differences that matter for deployment decisions.
Models can achieve high scores through:
- Memorizing benchmark-specific patterns
- Exploiting spurious correlations (answer position bias, keyword triggers)
- Over-verbose responses that contain correct substrings by chance
LivingBench is designed to detect and penalize these failure modes.
livingbench/
├── core/ # Type system and configuration
│ ├── types.py # Core data structures (Task, EvaluationResult, etc.)
│ ├── config.py # Hierarchical configuration management
│ └── registry.py # Component registration
├── tasks/ # Task generation engine
│ ├── generator.py # Main orchestration
│ ├── labeling.py # Skill and difficulty estimation
│ └── sources/ # Task sources
│ ├── synthetic.py # Synthetic reasoning/math tasks
│ ├── github.py # GitHub issues/PRs
│ ├── arxiv.py # ArXiv abstracts
│ └── tool_use.py # Tool use scenarios
├── evaluation/ # Evaluation pipeline
│ ├── pipeline.py # Main evaluation orchestration
│ ├── skill_decomposition.py # Skill-factorized scoring
│ ├── capability_fingerprint.py # Multi-dimensional profiles
│ └── metrics.py # Standardized metrics
├── robustness/ # Anti-gaming layer
│ ├── paraphrase.py # Paraphrase generation
│ ├── counterfactual.py # Counterfactual variants
│ ├── adversarial.py # Adversarial perturbations
│ └── detectors.py # Gaming/memorization detection
├── temporal/ # Learning-over-time evaluation
│ ├── session_tracker.py # Multi-session tracking
│ ├── learning_metrics.py # Learning curve analysis
│ └── degradation.py # Performance degradation detection
├── judges/ # LLM-as-Judge system
│ ├── base.py # Judge interfaces
│ ├── ensemble.py # Multi-judge aggregation
│ ├── calibration.py # Calibration analysis
│ └── disagreement.py # Disagreement patterns
└── utils/ # Utilities
├── logging.py # Structured logging
└── reproducibility.py # Deterministic execution
Tasks are generated from multiple sources to ensure diversity and freshness:
| Source | Description | Skills Tested |
|---|---|---|
| Synthetic Reasoning | Syllogisms, constraint satisfaction, propositional logic | Logical deduction, multi-step planning |
| Synthetic Math | Word problems, algebra, probability, number theory | Mathematical reasoning, reading comprehension |
| GitHub Issues | Real bug reports and feature requests | Code understanding, causal inference |
| ArXiv Abstracts | Paper comprehension and analysis | Reading comprehension, knowledge integration |
| Tool Use Scenarios | Selection, composition, error recovery | Tool selection, error recovery |
Each task is labeled with:
- Required skills (from a taxonomy of 20+ cognitive capabilities)
- Difficulty level (trivial → adversarial)
- Content hash (for deduplication)
Rather than single accuracy, we produce capability fingerprints:
fingerprint = CapabilityFingerprint(
model_id="gpt-4-turbo",
skill_scores={
"logical_deduction": 0.85,
"mathematical_reasoning": 0.78,
"causal_inference": 0.72,
"tool_selection": 0.91,
"error_recovery": 0.65,
# ... 15+ more skills
},
difficulty_scores={
"easy": 0.95,
"medium": 0.82,
"hard": 0.61,
"very_hard": 0.38,
},
paraphrase_consistency=0.89,
adversarial_robustness=0.76,
calibration_error=0.08,
)This enables:
- Fine-grained model comparison
- Targeted capability improvement
- Deployment-specific model selection
For each task, we automatically generate:
| Variant Type | Purpose | Expected Behavior |
|---|---|---|
| Paraphrases | Same meaning, different wording | Answer should be consistent |
| Counterfactuals | Changed premises | Answer should change appropriately |
| Adversarial | Perturbations and distractors | Answer should remain correct |
Detected gaming patterns:
- Memorization (high accuracy on originals, low on paraphrases)
- Spurious correlation (answer position bias, keyword triggers)
- Verbosity gaming (excessive length without correctness)
- Confidence gaming (uniformly high confidence regardless of difficulty)
No single judge is trusted. We use:
ensemble = EnsembleJudge(
judges=[
ExactMatchJudge(),
RubricJudge(),
LLMJudge("gpt-4"),
LLMJudge("claude-3"),
],
aggregation="weighted", # by calibrated confidence
)Analysis includes:
- Pairwise agreement rates between judges
- Calibration metrics (ECE, overconfidence rate)
- Disagreement patterns (which task types cause disagreement?)
# Clone the repository
git clone https://github.com/Ayushhgit/LivingBench.git
cd LivingBench
# Install with pip (Python 3.10+)
pip install -e ".[all]"
# Or with specific providers
pip install -e ".[openai,anthropic]"# Run Groq experiment (set GROQ_API_KEY in .env first)
python scripts/run_groq_experiment.py --n-tasks 50
# View results
cat outputs/groq_experiment/results.jsonfrom livingbench import LivingBenchConfig
from livingbench.tasks import TaskGenerationEngine
from livingbench.evaluation import EvaluationPipeline, SimplePipeline
from livingbench.evaluation.capability_fingerprint import FingerprintComputer
# Generate tasks
engine = TaskGenerationEngine(seed=42)
tasks = engine.generate_balanced(n_tasks=100)
# Evaluate with your model
def my_model(prompt: str) -> str:
# Your model inference here
return "model response"
def my_judge(task, response, reference):
# Your judging logic
correct = reference.lower() in response.lower()
return correct, 1.0 if correct else 0.0, "Matched" if correct else "No match"
pipeline = SimplePipeline()
results = pipeline.evaluate_with_model(
tasks=tasks,
model_fn=my_model,
judge_fn=my_judge,
model_id="my_model",
)
# Compute capability fingerprint
computer = FingerprintComputer()
fingerprint = computer.compute("my_model", results)
print(f"Overall accuracy: {sum(r.is_correct for r in results) / len(results):.1%}")
print(f"Skill scores: {fingerprint.skill_scores}")LivingBench uses hierarchical YAML configuration:
# configs/experiments/default.yaml
experiment_name: "baseline_comparison"
random_seed: 42
task_generation:
synthetic_reasoning_enabled: true
synthetic_math_enabled: true
tool_use_enabled: true
synthetic_tasks_count: 200
evaluation:
n_judges: 3
judge_models:
- "gpt-4-turbo"
- "claude-3-sonnet"
compute_skill_scores: true
compute_calibration: true
robustness:
paraphrase_enabled: true
n_paraphrases: 3
adversarial_enabled: true
temporal:
enabled: false # Enable for multi-session evaluation- Identify capability gaps to target during training
- Track progress across model versions
- Compare fine-tuning strategies
- Detect memorization and gaming
- Measure robustness to adversarial inputs
- Identify overconfidence patterns
- Match model capabilities to application requirements
- Quantify reliability and consistency
- Monitor for capability degradation
- Synthetic Task Bias: Generated tasks may not fully represent real-world distributions
- Judge Reliability: LLM judges inherit their own biases and failure modes
- Skill Taxonomy: Our skill categories are approximations of latent cognitive capabilities
- Temporal Evaluation: Currently limited to simulated multi-session scenarios
- Human Calibration: Collect human judgments to calibrate LLM judges
- Dynamic Difficulty: Adaptive testing based on demonstrated capability
- Cross-Lingual Extension: Evaluate multilingual capabilities
- Agentic Evaluation: Long-horizon tool use and planning tasks
- Benchmark-of-Benchmarks: Meta-evaluation against other benchmark suites
MIT License. See LICENSE for details.