Skip to content

williamjwall/psychlingbench

Repository files navigation

psychlingbench

Python 3.8+ License: MIT

psychlingbench is a lightweight, extensible benchmarking toolkit designed to evaluate how well language models align with human word-by-word processing behavior, particularly during natural reading.

Purpose

Rather than scoring language models by perplexity, BLEU, or task accuracy, psychlingbench uses behavioral science signals to directly assess how well models mirror human time-locked processing:

  • First-pass fixation durations (eye-tracking)
  • Reaction times
  • EEG/ERP responses (future versions)

The benchmark uses noise-normalized Pearson R² between model predictions and human data, making scores directly comparable across different datasets and measurement types.

Installation

# Basic installation
pip install psychlingbench

# With baseline model support (requires PyTorch)
pip install psychlingbench[baselines]

# Development installation
pip install -e ".[dev,baselines]"

Quick Start

from psychlingbench.benchmarks.fix10k import Fix10kBenchmark
from psychlingbench.baselines.gpt2_surprisal import predict_fix_ms

# Initialize benchmark
benchmark = Fix10kBenchmark()

# Evaluate a model
results = benchmark.evaluate(
    model=None,  # The model is encapsulated in the predict function
    predict_fixation_function=predict_fix_ms,
    model_name="gpt2_surprisal"
)

print(f"Score: {results['metrics']['noise_normalized']:.3f}")

CLI Usage

# Download benchmark data
psychlingbench download fix10k

# Run evaluation with GPT-2 surprisal baseline
psychlingbench eval gpt2_surprisal fix10k

# Launch visualization dashboard
psychlingbench view

Available Benchmarks

  • fix10k: First-pass fixation durations from eye-tracking data (v0.1)

Development Roadmap

Version Adds Purpose
v0.1 fix10k (first-pass fixations) Core reading behavior test
v0.2 RT benchmark (e.g., ELP) Tests reaction time prediction
v0.3 EEG/ERP signals (e.g., N400) Adds neural-behavioral links
v0.4 Energy probe support Enables fit-vs-cost analysis
v1.0 Composite PsychScore Unified score across domains

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages