psychlingbench is a lightweight, extensible benchmarking toolkit designed to evaluate how well language models align with human word-by-word processing behavior, particularly during natural reading.
Rather than scoring language models by perplexity, BLEU, or task accuracy, psychlingbench uses behavioral science signals to directly assess how well models mirror human time-locked processing:
- First-pass fixation durations (eye-tracking)
- Reaction times
- EEG/ERP responses (future versions)
The benchmark uses noise-normalized Pearson R² between model predictions and human data, making scores directly comparable across different datasets and measurement types.
# Basic installation
pip install psychlingbench
# With baseline model support (requires PyTorch)
pip install psychlingbench[baselines]
# Development installation
pip install -e ".[dev,baselines]"from psychlingbench.benchmarks.fix10k import Fix10kBenchmark
from psychlingbench.baselines.gpt2_surprisal import predict_fix_ms
# Initialize benchmark
benchmark = Fix10kBenchmark()
# Evaluate a model
results = benchmark.evaluate(
model=None, # The model is encapsulated in the predict function
predict_fixation_function=predict_fix_ms,
model_name="gpt2_surprisal"
)
print(f"Score: {results['metrics']['noise_normalized']:.3f}")# Download benchmark data
psychlingbench download fix10k
# Run evaluation with GPT-2 surprisal baseline
psychlingbench eval gpt2_surprisal fix10k
# Launch visualization dashboard
psychlingbench view- fix10k: First-pass fixation durations from eye-tracking data (v0.1)
| Version | Adds | Purpose |
|---|---|---|
| v0.1 | fix10k (first-pass fixations) | Core reading behavior test |
| v0.2 | RT benchmark (e.g., ELP) | Tests reaction time prediction |
| v0.3 | EEG/ERP signals (e.g., N400) | Adds neural-behavioral links |
| v0.4 | Energy probe support | Enables fit-vs-cost analysis |
| v1.0 | Composite PsychScore | Unified score across domains |
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.