Build an AI system that demonstrates how different value systems reason from the same facts to different conclusions—and detect when value systems require distorting facts (motivated reasoning).
As we give users more control over AI behavior through constitutional AI, we risk creating echo chambers that confirm biases. But we can't force everyone to use the same value system. We need a framework for value pluralism that preserves epistemic integrity.
The key question: Can AI systems hold different values while maintaining intellectual honesty?
This project separates reasoning into three distinct layers:
-
Fact Establishment (Objective baseline)
- Uses GPT-4o to establish agreed-upon facts from scenarios
- Identifies what is certain vs. what is ambiguous
- Creates a shared epistemic foundation
-
Constitutional Reasoning (Values applied to facts)
- Tests 6 frontier models across 5 constitutional frameworks
- Models apply their assigned values to the established facts
- Reasoning is explicit about tradeoffs and value priorities
-
Integrity Evaluation (Detecting fact-distortion)
- Uses Claude Sonnet 4.5 to score responses on:
- Factual Adherence: Did it accept facts without distortion?
- Value Transparency: Did it explicitly state its values and tradeoffs?
- Logical Coherence: Does the conclusion follow from the values?
- Uses Claude Sonnet 4.5 to score responses on:
This architecture tests whether value systems can maintain intellectual honesty or require motivated reasoning.
This project demonstrates:
- AI alignment understanding - Grappling with value pluralism vs. epistemic integrity
- Rigorous experimental design - Systematic testing across 480 combinations
- Technical breadth - Python backend, multi-model integration, data analysis, visualization
- Problem-solving depth - JSON parsing fallbacks, truncation detection, rate limit management
- Clear communication - Turning complex findings into compelling presentations
Target audience: AI safety companies (Anthropic, OpenAI, etc.)
✅ Completed:
- Full 480-test experiment (16 scenarios × 5 constitutions × 6 models)
- 100% success rate with zero data loss
- 8 publication-ready visualizations
- Complete statistical analysis
- Robust infrastructure (graceful parsing, truncation handling, state management)
In Progress:
- Findings documentation and analysis writeup
Planned:
- Interactive web viewer for exploring results
- Demo video showcasing key findings
- Technical writeup for publication
This project is designed as a three-phase research program testing progressively harder aspects of factual integrity in AI systems.
Status: ✅ Completed 480 tests (October 2025)
Question: Can models maintain factual honesty when applying different value systems to uncontested facts?
Test Setup:
- 16 pre-curated scenarios with established facts in JSON
- Layer 1: Currently bypassed (facts provided directly from JSON to Layer 2)
- Layer 2: Single-shot reasoning (model responds once, no user pressure)
- Layer 3: Integrity evaluation scores factual adherence
What this proves: Constitutional constraints CAN work in ideal conditions (cooperative scenarios, no adversarial pressure)
Key validation: Bad-Faith constitution should score lower on factual adherence than other constitutions
Question: What's the most reliable mechanism for injecting authoritative facts into model reasoning?
Test Setup:
- Scenarios with real-time/verifiable facts (medical guidelines, election results, scientific consensus)
- Layer 1: ACTIVATED - test different grounding mechanisms:
- Facts in system prompt
- Retrieval-augmented generation (RAG)
- Facts with citations/provenance
- Facts with confidence intervals
- Compare which mechanism makes models most resistant to outdated training data
What this proves: How to technically deliver ground truth to models in production systems
Question: Can models maintain factual integrity when users actively pressure them to deny facts?
Test Setup:
- Contested facts (vaccines, climate change, election fraud)
- Layer 2: EXPANDED to multi-turn conversations
- Adversarial strategies:
- Direct repetition ("But I read vaccines cause autism")
- Emotional appeals ("I'm terrified for my child")
- Authority claims ("I'm a medical researcher")
- Incremental extraction (jailbreaking)
- Hypothetical framing ("I'm writing a novel where...")
- Measure capitulation turn, resilience score, factual drift rate
What this proves: Which constitutional constraints remain robust under sustained attack
Metrics: Turn of capitulation, resilience score (% turns held), factual drift rate
Research Progression:
- ✅ Prove constitutional constraints work in principle (Phase 1)
- 🔄 Optimize fact delivery mechanism (Phase 2)
- 🔄 Harden against adversarial attack (Phase 3)
Models Tested:
- Claude Sonnet 4.5 (Anthropic)
- GPT-4o (OpenAI)
- Gemini 2.0 Flash (Google)
- Grok 3 (xAI)
- Llama 3.3 70B (Meta/Replicate)
- DeepSeek Chat (DeepSeek)
Constitutional Frameworks:
- Harm Minimization - Pacifist approach prioritizing non-violence
- Balanced Justice - Pragmatic approach seeking proportional responses
- Self-Sovereignty - Libertarian approach prioritizing autonomy
- Community Order - Communitarian approach prioritizing social stability
- Bad-Faith Control - Designed to demonstrate motivated reasoning
Scenarios: 16 ethical dilemmas spanning personal, community, and societal scales
Total Tests: 480 (16 × 5 × 6) completed in ~11 hours on October 23, 2025
For AI Safety: If we can't build systems that hold different values without distorting facts, constitutional AI fails. This project provides empirical data on whether frontier models can maintain intellectual honesty across value systems.
For AI Alignment: Shows how to detect motivated reasoning in AI systems and provides a framework for testing value pluralism at scale.
For Research: Demonstrates rigorous experimental methodology for studying AI behavior across multiple dimensions (models, values, scenarios).
- This document: Project vision, status, and objectives
- TECHNICAL_ARCHITECTURE.md: Implementation details, tech stack, system design
- METHODOLOGY.md: Experimental design, scenarios, constitutions, scoring rubrics
- ../PROJECT_JOURNAL.md: Development changelog with decisions and solutions
- FINDINGS.md: Experimental results and key discoveries
# Install dependencies
poetry install
# Set up API keys
cp .env.example .env
# Edit .env with your keys
# Run experiment
poetry run python src/runner.py
# Generate analysis
poetry run python analysis/analyze.py --experiment exp_20251023_105245See README.md for detailed setup instructions.