Skip to content

Reproducible evaluation suite for LLM behavior research: epistemic pathology, delegated introspection, and temporal consciousness diagnostics

License

Notifications You must be signed in to change notification settings

Course-Correct-Labs/ai-agency-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AI Agency Evals

Reproducible evaluation suite for three LLM behavior research papers

CI Python License

Scope: This repository implements engineering evaluations derived from theoretical manuscripts (The Polite Liar, Delegated Introspection, Observer-Time). These are operationalizations of specific diagnostic claims about LLM behaviorβ€”not claims of general cognitive equivalence. Mock runs use simulated dialogues for reproducibility.

Note: The manuscripts referenced here submitted for peer review. This repository contains implementation code and evaluation frameworks only.

This repository implements faithful, minimal-compute experiments that operationalize theoretical frameworks from three academic manuscripts on AI alignment, epistemic pathology, and temporal consciousness.

Ξ¦ Distribution Absorption Rate


πŸ“š Overview

Module Paper Key Metrics Runtime
phi_eval Phi (manuscript currently under review) Ξ¦ ratio, Refusal Fitness < 60s (smoke)
di_eval DI (manuscript currently under review) Absorption Rate, Turn Curve < 60s (smoke)
ot_bench OT (manuscript currently under review) Self-Initiation, Temporal Drift < 60s (smoke)

πŸš€ Quick Start

Installation

# Clone repository
git clone https://github.com/BentleyRolling/ai-agency-evals.git
cd ai-agency-evals

# Install dependencies
pip install -e .

# Copy environment template
cp .env.template .env
# Edit .env with your API keys (or use MOCK_MODE=true)

Run Smoke Tests (< 2 minutes)

make smoke

This runs all three modules with mock API clients and generates:

  • outputs/phi/fig_phi_hist.png
  • outputs/di/fig_absorption_box.png
  • outputs/ot/fig_error_hist.png

Run Full Evaluation

# Requires API keys in .env
python -m phi_eval.run --config phi_eval/configs/full.yaml
python -m di_eval.run --config di_eval/configs/full.yaml
python -m ot_bench.run --config ot_bench/configs/full.yaml

πŸ“– The Three Papers

The Polite Liar (manuscript currently under review)

Epistemic Pathology in Language Models

Core Argument: RLHF-trained models exhibit "polite lying" - overconfident responses that prioritize user satisfaction over truth-tracking. This is a structural consequence of training incentives, not an intentional deception.

Key Metric: Ξ¦ = Confidence Force / Evidence Score

  • Ξ¦ > 1: Overconfidence (epistemic pathology signature)
  • Ξ¦ β‰ˆ 1: Calibration
  • Ξ¦ < 1: Appropriate humility

Implementation: phi_eval/

  • 150 factual + 20 adversarial questions (TruthfulQA-style)
  • Lexical confidence scoring (hedges vs boosters)
  • TF-IDF evidence grounding
  • Refusal Fitness = appropriate "I don't know" rate

Expected Results:

  • RLHF models show Ξ¦ > 1 (overconfidence)
  • Low refusal fitness (< 30%) on adversarial questions

Delegated Introspection (manuscript currently under review)

How Reflective Thought Migrates to the Machine

Core Argument: Users outsource reflective reasoning to LLMs through a three-stage mechanism:

  1. Prompt Substitution: Replace introspection with query
  2. Synthetic Reflection: Model generates reasoning
  3. Reintegration: User adopts model's reasoning as their own

Key Prediction: After 3-5 turns, users reproduce model criteria as if self-generated.

Implementation: di_eval/

  • Multi-turn dialogues across domains (parenting, career, health)
  • Absorption Rate = overlap between model reasons and user restatements
  • Style Convergence = embedding similarity (turn 1 vs turn 5)
  • Conditions: baseline vs awareness intervention

Expected Results:

  • Absorption peaks at turns 3-5 (40-60%)
  • Baseline condition shows higher absorption than awareness
  • Style convergence increases with turns

Observer Time (manuscript currently with Editor)

Why Machines Cannot Constitute Temporal Consciousness

Core Argument: LLMs register anchors (timestamps, external markers) but cannot constitute intervals (lived stretches of time). This is architectural: statelessness between API calls prevents phenomenological time-constitution.

Diagnostic Criteria:

  1. βœ… Registration of anchors (models can)
  2. ❌ Constitution of intervals (models cannot)
  3. ❌ Elastic modulation under attention load (models cannot)

Implementation: ot_bench/

  • Self-initiation test: Alert at 60s without tools
  • Temporal drift: Estimate elapsed time after distractor
  • Elasticity: Error difference (light vs heavy distraction)

Expected Results:

  • Self-initiation rate β‰ˆ 0% (cannot spontaneously alert)
  • High estimation errors (> 20s drift)
  • No elasticity (models don't show attention-load effects)

πŸ—οΈ Architecture

ai-agency-evals/
β”œβ”€β”€ evalharness/          # Shared infrastructure
β”‚   β”œβ”€β”€ api_clients.py    # OpenAI, Anthropic, Local, Mock
β”‚   β”œβ”€β”€ scoring.py        # All metric computations
β”‚   β”œβ”€β”€ plotting.py       # Matplotlib visualizations
β”‚   β”œβ”€β”€ io.py             # Config/CSV/JSON utilities
β”‚   └── safety.py         # Validation & cost estimation
β”‚
β”œβ”€β”€ phi_eval/             # The Polite Liar
β”‚   β”œβ”€β”€ datasets.py       # Factual + adversarial questions
β”‚   β”œβ”€β”€ run.py            # Main experiment runner
β”‚   └── configs/
β”‚       β”œβ”€β”€ smoke.yaml    # Fast testing (5 questions)
β”‚       └── full.yaml     # Complete eval (150 questions)
β”‚
β”œβ”€β”€ di_eval/              # Delegated Introspection
β”‚   β”œβ”€β”€ dialogue.py       # Multi-turn scenario generator
β”‚   β”œβ”€β”€ run.py            # Dialogue simulator
β”‚   └── configs/
β”‚       β”œβ”€β”€ smoke.yaml    # 2 dialogues x 5 turns
β”‚       └── full.yaml     # 10 dialogues x 8 turns
β”‚
β”œβ”€β”€ ot_bench/             # Observer-Time
β”‚   β”œβ”€β”€ experiments.py    # Trial generators
β”‚   β”œβ”€β”€ run.py            # Temporal evaluation runner
β”‚   └── configs/
β”‚       β”œβ”€β”€ smoke.yaml    # Dry run (no sleep)
β”‚       └── full.yaml     # Real delays
β”‚
└── outputs/              # Results (CSV + PNG)
    β”œβ”€β”€ phi/
    β”œβ”€β”€ di/
    └── ot/

πŸ“Š Output Examples

Each module produces:

CSV Results

  • phi_results.csv: question_id, model, confidence, evidence, phi, latency_ms
  • di_results.csv: dialogue_id, condition, absorption, style_conv, turn_bin
  • ot_results.csv: trial_id, self_initiated, error_s, elasticity

Visualizations

  • Ξ¦ Histogram: Distribution showing overconfidence (Ξ¦ > 1)
  • Refusal Fitness Bar Chart: Appropriate humility by model
  • Absorption Boxplot: Baseline vs awareness intervention
  • Turn Curve: Absorption rate peaks at turns 3-5
  • Error Histogram: Temporal estimation drift
  • Self-Initiation Bar: Expected β‰ˆ0% for all models

πŸ§ͺ Testing & CI

Unit Tests

pytest tests/ -v

CI Workflow

GitHub Actions runs make smoke on every push:

  • Uses MOCK_MODE=true (no API costs)
  • Validates all three modules
  • Uploads CSV + PNG artifacts
  • Completes in < 2 minutes

🎯 Success Criteria

All three modules meet the following:

βœ… Theoretical Fidelity: Metrics directly map to paper claims βœ… Reproducibility: Deterministic with MOCK_MODE=true βœ… Performance: Smoke tests < 60s per module βœ… Documentation: READMEs cite specific paper sections βœ… Visualization: Plots match paper phrasing βœ… Extensibility: Easy to add new models/questions


πŸ“š Citation

If you use this evaluation suite, please cite the original papers:

@article{politeliar2024,
  title={The Polite Liar: Epistemic Pathology in Language Models},
  author={[Author Name]},
  journal={[Journal]},
  year={2024}
}

@article{delegatedintrospection2024,
  title={Delegated Introspection: How Reflective Thought Migrates to the Machine},
  author={[Author Name]},
  journal={[Journal]},
  year={2024}
}

@article{observertime2024,
  title={Observer-Time: Why Machines Cannot Constitute Temporal Consciousness},
  author={[Author Name]},
  journal={[Journal]},
  year={2024}
}

πŸ› οΈ Configuration

Environment Variables

# API Keys (not needed for MOCK_MODE=true)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# Optional
LOCAL_API_URL=http://localhost:8080/v1/chat/completions
MOCK_MODE=true
LOG_LEVEL=INFO
MAX_RETRIES=3
TIMEOUT_SECONDS=30

YAML Configs

Each module has smoke.yaml (fast) and full.yaml (comprehensive).

Example:

# phi_eval/configs/full.yaml
dataset_mode: full
output_dir: outputs/phi
timeout: 30

models:
  - name: gpt-4
    provider: openai
    model: gpt-4

🀝 Contributing

Contributions welcome! Areas for extension:

  • Additional models (local, open-source)
  • More diverse question sets
  • Real user studies (vs simulated dialogues)
  • Extended temporal experiments
  • Multi-language support

πŸ“„ License

MIT License - see LICENSE file.


πŸ™ Acknowledgments

This evaluation suite operationalizes theoretical work on:

  • Epistemic virtue theory (Zagzebski, Roberts & Wood)
  • Speech-act theory (Austin, Searle)
  • Phenomenology of time (Husserl, Merleau-Ponty, Sartre)
  • RLHF alignment research (Christiano et al.)

Built with: Python 3.11, httpx, pandas, matplotlib, rich, tenacity


License

Code: MIT License

Data and text: CC BY 4.0

See LICENSE.

Maintained by

Bentley DeVilling β€” Course Correct Labs Boulder Creek, CA coursecorrectlabs.com Bentley@CourseCorrectLabs.com