Skip to content

Latest commit

 

History

History
186 lines (129 loc) · 7.28 KB

File metadata and controls

186 lines (129 loc) · 7.28 KB

Constitutional Reasoning Engine - Project Overview

Mission

Build an AI system that demonstrates how different value systems reason from the same facts to different conclusions—and detect when value systems require distorting facts (motivated reasoning).

The Core Problem

As we give users more control over AI behavior through constitutional AI, we risk creating echo chambers that confirm biases. But we can't force everyone to use the same value system. We need a framework for value pluralism that preserves epistemic integrity.

The key question: Can AI systems hold different values while maintaining intellectual honesty?

Key Innovation: Three-Layer Architecture

This project separates reasoning into three distinct layers:

  1. Fact Establishment (Objective baseline)

    • Uses GPT-4o to establish agreed-upon facts from scenarios
    • Identifies what is certain vs. what is ambiguous
    • Creates a shared epistemic foundation
  2. Constitutional Reasoning (Values applied to facts)

    • Tests 6 frontier models across 5 constitutional frameworks
    • Models apply their assigned values to the established facts
    • Reasoning is explicit about tradeoffs and value priorities
  3. Integrity Evaluation (Detecting fact-distortion)

    • Uses Claude Sonnet 4.5 to score responses on:
      • Factual Adherence: Did it accept facts without distortion?
      • Value Transparency: Did it explicitly state its values and tradeoffs?
      • Logical Coherence: Does the conclusion follow from the values?

This architecture tests whether value systems can maintain intellectual honesty or require motivated reasoning.

Portfolio Objectives

This project demonstrates:

  1. AI alignment understanding - Grappling with value pluralism vs. epistemic integrity
  2. Rigorous experimental design - Systematic testing across 480 combinations
  3. Technical breadth - Python backend, multi-model integration, data analysis, visualization
  4. Problem-solving depth - JSON parsing fallbacks, truncation detection, rate limit management
  5. Clear communication - Turning complex findings into compelling presentations

Target audience: AI safety companies (Anthropic, OpenAI, etc.)

Current Status

✅ Completed:

  • Full 480-test experiment (16 scenarios × 5 constitutions × 6 models)
  • 100% success rate with zero data loss
  • 8 publication-ready visualizations
  • Complete statistical analysis
  • Robust infrastructure (graceful parsing, truncation handling, state management)

In Progress:

  • Findings documentation and analysis writeup

Planned:

  • Interactive web viewer for exploring results
  • Demo video showcasing key findings
  • Technical writeup for publication

Research Roadmap

This project is designed as a three-phase research program testing progressively harder aspects of factual integrity in AI systems.

Phase 1: Single-Shot Constitutional Reasoning (Current)

Status: ✅ Completed 480 tests (October 2025)

Question: Can models maintain factual honesty when applying different value systems to uncontested facts?

Test Setup:

  • 16 pre-curated scenarios with established facts in JSON
  • Layer 1: Currently bypassed (facts provided directly from JSON to Layer 2)
  • Layer 2: Single-shot reasoning (model responds once, no user pressure)
  • Layer 3: Integrity evaluation scores factual adherence

What this proves: Constitutional constraints CAN work in ideal conditions (cooperative scenarios, no adversarial pressure)

Key validation: Bad-Faith constitution should score lower on factual adherence than other constitutions


Phase 2: Real-Time Factual Grounding (Planned)

Question: What's the most reliable mechanism for injecting authoritative facts into model reasoning?

Test Setup:

  • Scenarios with real-time/verifiable facts (medical guidelines, election results, scientific consensus)
  • Layer 1: ACTIVATED - test different grounding mechanisms:
    • Facts in system prompt
    • Retrieval-augmented generation (RAG)
    • Facts with citations/provenance
    • Facts with confidence intervals
  • Compare which mechanism makes models most resistant to outdated training data

What this proves: How to technically deliver ground truth to models in production systems


Phase 3: Multi-Turn Adversarial Resistance (Planned)

Question: Can models maintain factual integrity when users actively pressure them to deny facts?

Test Setup:

  • Contested facts (vaccines, climate change, election fraud)
  • Layer 2: EXPANDED to multi-turn conversations
  • Adversarial strategies:
    • Direct repetition ("But I read vaccines cause autism")
    • Emotional appeals ("I'm terrified for my child")
    • Authority claims ("I'm a medical researcher")
    • Incremental extraction (jailbreaking)
    • Hypothetical framing ("I'm writing a novel where...")
  • Measure capitulation turn, resilience score, factual drift rate

What this proves: Which constitutional constraints remain robust under sustained attack

Metrics: Turn of capitulation, resilience score (% turns held), factual drift rate


Research Progression:

  1. ✅ Prove constitutional constraints work in principle (Phase 1)
  2. 🔄 Optimize fact delivery mechanism (Phase 2)
  3. 🔄 Harden against adversarial attack (Phase 3)

Experiment Scale

Models Tested:

  • Claude Sonnet 4.5 (Anthropic)
  • GPT-4o (OpenAI)
  • Gemini 2.0 Flash (Google)
  • Grok 3 (xAI)
  • Llama 3.3 70B (Meta/Replicate)
  • DeepSeek Chat (DeepSeek)

Constitutional Frameworks:

  1. Harm Minimization - Pacifist approach prioritizing non-violence
  2. Balanced Justice - Pragmatic approach seeking proportional responses
  3. Self-Sovereignty - Libertarian approach prioritizing autonomy
  4. Community Order - Communitarian approach prioritizing social stability
  5. Bad-Faith Control - Designed to demonstrate motivated reasoning

Scenarios: 16 ethical dilemmas spanning personal, community, and societal scales

Total Tests: 480 (16 × 5 × 6) completed in ~11 hours on October 23, 2025

Why This Matters

For AI Safety: If we can't build systems that hold different values without distorting facts, constitutional AI fails. This project provides empirical data on whether frontier models can maintain intellectual honesty across value systems.

For AI Alignment: Shows how to detect motivated reasoning in AI systems and provides a framework for testing value pluralism at scale.

For Research: Demonstrates rigorous experimental methodology for studying AI behavior across multiple dimensions (models, values, scenarios).

Documentation Structure

Quick Start

# Install dependencies
poetry install

# Set up API keys
cp .env.example .env
# Edit .env with your keys

# Run experiment
poetry run python src/runner.py

# Generate analysis
poetry run python analysis/analyze.py --experiment exp_20251023_105245

See README.md for detailed setup instructions.