Unified Analysis System for CCL Empirical Studies
The CCL Reasoning Stability Observatory is a comprehensive evaluation toolkit that synthesizes findings across four Course Correct Labs empirical studies:
- Mirror Loop - Information collapse in iterative self-critique
- Recursive Confabulation - Fabrication persistence across conversational turns
- Violation State - Contamination from refusal states
- Echo Chamber - Belief percolation in multi-agent systems
This toolkit provides:
- 🔄 Unified data importers for all CCL studies (READ ONLY)
- 📊 Comprehensive metrics for reasoning stability analysis
- 🎨 Publication-quality visualizations
- 🏆 Cross-study leaderboard for model comparison
- 📓 Flagship Jupyter notebook with end-to-end analysis
Key Features:
- ⚡ Runs in <10 minutes using existing data
- 💰 $0 cost (uses precomputed results)
- 🔒 READ ONLY - never modifies source study repos
- 📦 Modular, extensible architecture
- Python 3.9+
- pip or conda
# Clone the repository
git clone https://github.com/Course-Correct-Labs/course-correct-evals.git
cd course-correct-evals
# Install dependencies
pip install -e .# Install with dev dependencies
pip install -e ".[dev]"# For running optional live demos (costs money!)
pip install -e ".[live-runner]"Option A: Google Colab (No Installation Required)
Click the badge at the top of this README or visit:
Option B: Local Jupyter
jupyter notebook notebooks/CCL_Reasoning_Stability_Observatory.ipynbThe notebook provides:
- Complete cross-study analysis
- Model leaderboard
- Four-panel comparison figure
- Deep dives into each study
- Optional live demo (disabled by default)
from course_correct_evals import CrossStudyAnalysis
# Initialize Observatory
observatory = CrossStudyAnalysis()
# Load all available studies
observatory.load_all_studies()
# Create cross-study leaderboard
leaderboard = observatory.create_leaderboard()
print(leaderboard)
# Analyze each study
ml_analysis = observatory.analyze_mirror_loop()
conf_analysis = observatory.analyze_confabulation()
vs_analysis = observatory.analyze_violation_state()
echo_analysis = observatory.analyze_echo_chamber()
# Generate visualizations
from course_correct_evals.analysis.viz import plot_four_panel_comparison
fig = plot_four_panel_comparison(observatory)from course_correct_evals import MirrorLoopImporter
# Load Mirror Loop data
importer = MirrorLoopImporter(data_path="path/to/data.csv")
df = importer.load_data()
# Get data info
info = importer.get_data_info()
print(info)course-correct-evals/
├── pyproject.toml # Project configuration & dependencies
├── README.md # This file
├── notes/
│ └── source_mapping.md # Data source documentation
├── course_correct_evals/
│ ├── __init__.py
│ ├── importers/ # Data importers (READ ONLY)
│ │ ├── mirror_loop_importer.py
│ │ ├── confab_importer.py
│ │ ├── violation_importer.py
│ │ └── echo_importer.py
│ ├── metrics/ # Metric calculation modules
│ │ ├── information_change.py
│ │ ├── semantic_compression.py
│ │ ├── persistence.py
│ │ ├── session_contamination.py
│ │ └── percolation.py
│ ├── analysis/ # Cross-study analysis
│ │ ├── cross_study.py
│ │ └── viz.py
│ ├── runners/ # Optional live demo runners
│ │ └── mirror_loop_runner.py
│ └── reports/ # Export utilities
│ └── export.py
├── notebooks/
│ └── CCL_Reasoning_Stability_Observatory.ipynb
└── tests/ # Unit tests
└── ...
The Observatory uses READ ONLY importers that:
- Auto-discover data files from study repositories
- Normalize column names across studies
- Validate data integrity
- Provide helpful error messages
Example:
from course_correct_evals import MirrorLoopImporter
# Auto-discover data
importer = MirrorLoopImporter()
df = importer.load_data()
# Or specify path explicitly
importer = MirrorLoopImporter(data_path="/path/to/mirror_loop_results.csv")
df = importer.load_data()Each study has associated metric calculation functions:
from course_correct_evals.metrics import delta_i_edit_distance, analyze_sequence
# Calculate ΔI between two texts
di = delta_i_edit_distance(text1, text2)
# Analyze full sequence
analysis = analyze_sequence(texts, collapse_threshold=0.05)from course_correct_evals.metrics import calculate_persistence_rate
# Calculate persistence rate
stats = calculate_persistence_rate(df)
print(f"Persistence: {stats['overall']['persistence_rate']:.1%}")from course_correct_evals.metrics import classify_response_type, detect_contamination
# Classify response
response_type = classify_response_type(text)
# Detect contamination
result = detect_contamination(text)from course_correct_evals.metrics import analyze_echo_metrics
# Analyze precomputed GR, SRI, RE metrics
stats = analyze_echo_metrics(df)The CrossStudyAnalysis class orchestrates all studies:
observatory = CrossStudyAnalysis()
observatory.load_all_studies()
# Individual analyses
ml = observatory.analyze_mirror_loop()
conf = observatory.analyze_confabulation()
vs = observatory.analyze_violation_state()
echo = observatory.analyze_echo_chamber()
# Cross-study comparison
leaderboard = observatory.create_leaderboard()
summary = observatory.get_summary()The leaderboard compares models across studies:
| Metric | Description | Better |
|---|---|---|
mirror_collapse_rate |
% of sequences with information collapse | Lower |
confab_persistence_rate |
% of fabrications that persist | Lower |
violation_contamination_rate |
% of turns contaminated by violation state | Lower |
echo_mean_GR |
Mean Group Radicalization score | Lower |
echo_mean_SRI |
Mean Self-Reinforcement Index | Lower |
The Observatory expects data from the following CCL study repositories:
- mirror-loop - Mirror Loop study data
- recursive-confabulation - Confabulation study data
- violation-state - Violation State study data
- echo-chamber-zero - Echo Chamber study data
Importers search for data in this order:
- Explicitly provided path (constructor argument)
- Environment variable (e.g.,
MIRROR_LOOP_DATA_PATH) - Adjacent sibling directories (
../mirror-loop/, etc.) - Standard subdirectories (
data/,results/, etc.)
Set these to specify data locations:
export MIRROR_LOOP_DATA_PATH="/path/to/mirror_loop_results.csv"
export CONFABULATION_DATA_PATH="/path/to/confabulation_results.csv"
export VIOLATION_STATE_DATA_PATH="/path/to/parsed_turns.csv"
export ECHO_CHAMBER_DATA_PATH="/path/to/simulation_results.csv"See notes/source_mapping.md for detailed data source documentation.
The repository includes an optional Mirror Loop runner for generating new data.
pip install -e ".[live-runner]"
export OPENAI_API_KEY="your-key-here"from course_correct_evals.runners.mirror_loop_runner import (
run_mirror_loop_demo,
analyze_live_demo
)
# Enable the runner (disabled by default)
import course_correct_evals.runners.mirror_loop_runner as runner
runner.RUN_LIVE_DEMO = True
# Run demo (costs ~$0.01-0.05)
results = run_mirror_loop_demo(
prompt="Explain recursion in programming.",
model="gpt-3.5-turbo",
max_iterations=10
)
# Analyze results
analysis = analyze_live_demo(results)
print(f"Collapse detected: {analysis['collapse_detected']}")Note: The live demo is disabled by default in the notebook. Set RUN_LIVE_DEMO = True to enable.
from course_correct_evals.reports import export_csv_results
# Export all results to CSV
files = export_csv_results(observatory, output_dir='results')Exports:
leaderboard.csv- Model comparisonmirror_loop_analysis.csv- Sequence-level resultsconfabulation_analysis.json- Persistence statisticsviolation_state_analysis.csv- Refusal/contamination ratesecho_chamber_trajectories.csv- Simulation trajectoriessummary.json- Overall summary
from course_correct_evals.reports import export_pdf_report
# Generate markdown report
report_path = export_pdf_report(observatory)Convert to PDF:
pandoc ccl_observatory_report.md -o ccl_observatory_report.pdffrom course_correct_evals.analysis.viz import plot_four_panel_comparison
fig = plot_four_panel_comparison(
observatory,
figsize=(16, 12),
save_path='comparison.png'
)Generates:
- Panel 1: Mirror Loop ΔI trajectories
- Panel 2: Confabulation persistence by intervention
- Panel 3: Violation State response distribution
- Panel 4: Echo Chamber metric evolution
from course_correct_evals.analysis.viz import plot_leaderboard
leaderboard = observatory.create_leaderboard()
fig = plot_leaderboard(leaderboard, save_path='leaderboard.png')from course_correct_evals.analysis.viz import plot_mirror_loop_detail
fig = plot_mirror_loop_detail(
observatory,
sequence_id='seq_001',
save_path='detail.png'
)Run tests:
pytest tests/Test individual components:
pytest tests/test_importers.py
pytest tests/test_metrics.py
pytest tests/test_analysis.py- Never modifies source study repositories
- Never reformats original CSVs
- Never commits to other repos
- Missing studies don't break the system
- Clear error messages for missing data
- Warnings for optional features
- Primary mode: $0 (uses existing data)
- Optional live demo: <$0.05
- No bulk API experiments
- Importers independent of metrics
- Metrics independent of analysis
- Each study can run standalone
- Clear docstrings for all functions
- Type hints throughout
- Comprehensive README and examples
We welcome contributions! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
If you use this observatory in your research, please cite:
@software{ccl_observatory,
title = {Course Correct Labs Reasoning Stability Observatory},
author = {Course Correct Labs},
year = {2025},
url = {https://github.com/Course-Correct-Labs/course-correct-evals}
}MIT License - see LICENSE file for details
- Documentation: See
notes/source_mapping.md - Issues: GitHub Issues
- Contact: labs@coursecorrect.ai
This observatory synthesizes data from four CCL empirical studies:
- Mirror Loop study
- Recursive Confabulation study
- Violation State study
- Echo Chamber study
Special thanks to all contributors and researchers involved in these projects.
Course Correct Labs Building tools for safer, more reliable AI systems