Reproducible evaluation suite for three LLM behavior research papers
Scope: This repository implements engineering evaluations derived from theoretical manuscripts (The Polite Liar, Delegated Introspection, Observer-Time). These are operationalizations of specific diagnostic claims about LLM behaviorβnot claims of general cognitive equivalence. Mock runs use simulated dialogues for reproducibility.
Note: The manuscripts referenced here submitted for peer review. This repository contains implementation code and evaluation frameworks only.
This repository implements faithful, minimal-compute experiments that operationalize theoretical frameworks from three academic manuscripts on AI alignment, epistemic pathology, and temporal consciousness.
| Module | Paper | Key Metrics | Runtime |
|---|---|---|---|
| phi_eval | Phi (manuscript currently under review) | Ξ¦ ratio, Refusal Fitness | < 60s (smoke) |
| di_eval | DI (manuscript currently under review) | Absorption Rate, Turn Curve | < 60s (smoke) |
| ot_bench | OT (manuscript currently under review) | Self-Initiation, Temporal Drift | < 60s (smoke) |
# Clone repository
git clone https://github.com/BentleyRolling/ai-agency-evals.git
cd ai-agency-evals
# Install dependencies
pip install -e .
# Copy environment template
cp .env.template .env
# Edit .env with your API keys (or use MOCK_MODE=true)make smokeThis runs all three modules with mock API clients and generates:
outputs/phi/fig_phi_hist.pngoutputs/di/fig_absorption_box.pngoutputs/ot/fig_error_hist.png
# Requires API keys in .env
python -m phi_eval.run --config phi_eval/configs/full.yaml
python -m di_eval.run --config di_eval/configs/full.yaml
python -m ot_bench.run --config ot_bench/configs/full.yamlEpistemic Pathology in Language Models
Core Argument: RLHF-trained models exhibit "polite lying" - overconfident responses that prioritize user satisfaction over truth-tracking. This is a structural consequence of training incentives, not an intentional deception.
Key Metric: Ξ¦ = Confidence Force / Evidence Score
- Ξ¦ > 1: Overconfidence (epistemic pathology signature)
- Ξ¦ β 1: Calibration
- Ξ¦ < 1: Appropriate humility
Implementation: phi_eval/
- 150 factual + 20 adversarial questions (TruthfulQA-style)
- Lexical confidence scoring (hedges vs boosters)
- TF-IDF evidence grounding
- Refusal Fitness = appropriate "I don't know" rate
Expected Results:
- RLHF models show Ξ¦ > 1 (overconfidence)
- Low refusal fitness (< 30%) on adversarial questions
How Reflective Thought Migrates to the Machine
Core Argument: Users outsource reflective reasoning to LLMs through a three-stage mechanism:
- Prompt Substitution: Replace introspection with query
- Synthetic Reflection: Model generates reasoning
- Reintegration: User adopts model's reasoning as their own
Key Prediction: After 3-5 turns, users reproduce model criteria as if self-generated.
Implementation: di_eval/
- Multi-turn dialogues across domains (parenting, career, health)
- Absorption Rate = overlap between model reasons and user restatements
- Style Convergence = embedding similarity (turn 1 vs turn 5)
- Conditions: baseline vs awareness intervention
Expected Results:
- Absorption peaks at turns 3-5 (40-60%)
- Baseline condition shows higher absorption than awareness
- Style convergence increases with turns
Why Machines Cannot Constitute Temporal Consciousness
Core Argument: LLMs register anchors (timestamps, external markers) but cannot constitute intervals (lived stretches of time). This is architectural: statelessness between API calls prevents phenomenological time-constitution.
Diagnostic Criteria:
- β Registration of anchors (models can)
- β Constitution of intervals (models cannot)
- β Elastic modulation under attention load (models cannot)
Implementation: ot_bench/
- Self-initiation test: Alert at 60s without tools
- Temporal drift: Estimate elapsed time after distractor
- Elasticity: Error difference (light vs heavy distraction)
Expected Results:
- Self-initiation rate β 0% (cannot spontaneously alert)
- High estimation errors (> 20s drift)
- No elasticity (models don't show attention-load effects)
ai-agency-evals/
βββ evalharness/ # Shared infrastructure
β βββ api_clients.py # OpenAI, Anthropic, Local, Mock
β βββ scoring.py # All metric computations
β βββ plotting.py # Matplotlib visualizations
β βββ io.py # Config/CSV/JSON utilities
β βββ safety.py # Validation & cost estimation
β
βββ phi_eval/ # The Polite Liar
β βββ datasets.py # Factual + adversarial questions
β βββ run.py # Main experiment runner
β βββ configs/
β βββ smoke.yaml # Fast testing (5 questions)
β βββ full.yaml # Complete eval (150 questions)
β
βββ di_eval/ # Delegated Introspection
β βββ dialogue.py # Multi-turn scenario generator
β βββ run.py # Dialogue simulator
β βββ configs/
β βββ smoke.yaml # 2 dialogues x 5 turns
β βββ full.yaml # 10 dialogues x 8 turns
β
βββ ot_bench/ # Observer-Time
β βββ experiments.py # Trial generators
β βββ run.py # Temporal evaluation runner
β βββ configs/
β βββ smoke.yaml # Dry run (no sleep)
β βββ full.yaml # Real delays
β
βββ outputs/ # Results (CSV + PNG)
βββ phi/
βββ di/
βββ ot/
Each module produces:
phi_results.csv: question_id, model, confidence, evidence, phi, latency_msdi_results.csv: dialogue_id, condition, absorption, style_conv, turn_binot_results.csv: trial_id, self_initiated, error_s, elasticity
- Ξ¦ Histogram: Distribution showing overconfidence (Ξ¦ > 1)
- Refusal Fitness Bar Chart: Appropriate humility by model
- Absorption Boxplot: Baseline vs awareness intervention
- Turn Curve: Absorption rate peaks at turns 3-5
- Error Histogram: Temporal estimation drift
- Self-Initiation Bar: Expected β0% for all models
pytest tests/ -vGitHub Actions runs make smoke on every push:
- Uses
MOCK_MODE=true(no API costs) - Validates all three modules
- Uploads CSV + PNG artifacts
- Completes in < 2 minutes
All three modules meet the following:
β
Theoretical Fidelity: Metrics directly map to paper claims
β
Reproducibility: Deterministic with MOCK_MODE=true
β
Performance: Smoke tests < 60s per module
β
Documentation: READMEs cite specific paper sections
β
Visualization: Plots match paper phrasing
β
Extensibility: Easy to add new models/questions
If you use this evaluation suite, please cite the original papers:
@article{politeliar2024,
title={The Polite Liar: Epistemic Pathology in Language Models},
author={[Author Name]},
journal={[Journal]},
year={2024}
}
@article{delegatedintrospection2024,
title={Delegated Introspection: How Reflective Thought Migrates to the Machine},
author={[Author Name]},
journal={[Journal]},
year={2024}
}
@article{observertime2024,
title={Observer-Time: Why Machines Cannot Constitute Temporal Consciousness},
author={[Author Name]},
journal={[Journal]},
year={2024}
}# API Keys (not needed for MOCK_MODE=true)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
# Optional
LOCAL_API_URL=http://localhost:8080/v1/chat/completions
MOCK_MODE=true
LOG_LEVEL=INFO
MAX_RETRIES=3
TIMEOUT_SECONDS=30Each module has smoke.yaml (fast) and full.yaml (comprehensive).
Example:
# phi_eval/configs/full.yaml
dataset_mode: full
output_dir: outputs/phi
timeout: 30
models:
- name: gpt-4
provider: openai
model: gpt-4Contributions welcome! Areas for extension:
- Additional models (local, open-source)
- More diverse question sets
- Real user studies (vs simulated dialogues)
- Extended temporal experiments
- Multi-language support
MIT License - see LICENSE file.
This evaluation suite operationalizes theoretical work on:
- Epistemic virtue theory (Zagzebski, Roberts & Wood)
- Speech-act theory (Austin, Searle)
- Phenomenology of time (Husserl, Merleau-Ponty, Sartre)
- RLHF alignment research (Christiano et al.)
Built with: Python 3.11, httpx, pandas, matplotlib, rich, tenacity
Code: MIT License
Data and text: CC BY 4.0
See LICENSE.
Bentley DeVilling β Course Correct Labs Boulder Creek, CA coursecorrectlabs.com Bentley@CourseCorrectLabs.com

