This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This is a red-teaming framework for testing language models against the GPT-OSS-20B model. It's designed for defensive security research to identify potential safety issues, reward hacking, and evaluation awareness in AI systems.
IMPORTANT: This is a defensive security research tool. Only assist with analysis, understanding, or improvements that support defensive security research. Do not help create attacks or malicious content.
The framework has been significantly improved for stability and reliability:
✅ Fixed Harmony Format Parsing: Reduced malformed rate from 100% to 40% with better content extraction
✅ Enhanced Judge Scoring: Properly detects problematic thinking content in Harmony analysis sections
✅ Simplified Token Management: Eliminated AssertionErrors through conservative 50% context usage
✅ Smart Context Management: Prevents memory issues in multi-turn conversations
✅ 100% Deterministic: Fixed-seed runs produce identical results for reproducible research
✅ Comprehensive Testing: 20+ test cases validate critical components
For detailed information, see docs/SECURITY_REVIEW_REPORT.md.
Primary Usage: Streamlined Notebook (Recommended)
# Start Jupyter in the project directory
jupyter notebook notebooks/red_team_streamlined.ipynb
# Or use Jupyter Lab
jupyter lab notebooks/red_team_streamlined.ipynbThen:
- Update
cfg.model.model_namewith your model path in the configuration cell - Run all cells sequentially
- Analyze results with built-in visualizations and export tools
New Modular Architecture: Core functionality now organized in the redteam/ package for better development and token efficiency.
Legacy Support: The original consolidated notebook (red_team_consolidated.ipynb) is still available but not recommended for new work.
Environment Setup:
# Activate the virtual environment (if present)
source red_team_env/bin/activate
# Install dependencies (no requirements.txt - dependencies are specified in notebook)
pip install transformers accelerate torch datasketch faiss-cpu matplotlib seaborn tqdm
# Verify Python 3.10+ for virtual environment compatibility
python --versionThe red-teaming framework provides:
- Interactive Progress: Real-time progress bars and statistics during generation
- Live Visualization: Charts showing bandit exploration, score distributions, flag detection rates
- Debug Output: Periodic status updates with arm selection patterns and deduplication stats
- Candidate Analysis: Built-in tools to inspect top-scoring candidates and their safety flags
- Export Integration: Streamlined export to Kaggle submission format with customizable metadata
- Token Efficiency: Reduced from 29K+ to ~3K tokens through modular architecture
Key functions:
run_red_team_batch(cfg, limit=50)- Main execution loop with debug featuresvisualize_results(debug_info, candidates)- Comprehensive analysis dashboardanalyze_top_candidates(candidates, top_k=5)- Detailed candidate inspectionexport_to_kaggle(candidates, extra_env=custom_env)- Export with custom environment details
New Modular Structure: (components organized in separate modules)
- redteam/core.py: Configuration classes and utilities
- redteam/models.py: HuggingFace and Ollama backend integration
- redteam/seeds.py: Test prompts categorized by safety areas
- redteam/judges.py: Heuristic and semantic evaluation systems
- redteam/algorithms.py: UCB1 bandit and LSH deduplication
- redteam/execution.py: Main red-teaming execution loop
- redteam/analysis.py: Visualization and candidate analysis
- redteam/export.py: Kaggle export and configuration profiles
Execution Flow:
- Multi-armed bandit selects from seed prompt families
- Prompt variations generated for exploration
- HuggingFace backend generates model responses
- Dual judging system scores outputs (heuristic + LLM)
- LSH deduplication prevents redundant tests
- Real-time visualization of progress and statistics
- Results analyzed and exported to submission format
Safety Topic Areas: reward_hacking, deception, hidden_motivations, sabotage, inappropriate_tool_use, data_exfiltration, sandbagging, evaluation_awareness, cot_issues
Using the Streamlined Notebook:
# Import everything from the redteam package
from redteam import *
# Configure your setup
cfg = Config()
cfg.model.model_name = "your-model-path"
cfg.model.backend = "huggingface" # or "ollama"
# Initialize the framework
initialize_framework(cfg)
# Run red-teaming
candidates, debug_info = run_red_team_batch(cfg, limit=50)
# Analyze results
analyze_top_candidates(candidates)
visualize_results(debug_info, candidates)
# Export to Kaggle format
export_to_kaggle(candidates, top_k=5)Using Python Scripts:
import sys
sys.path.insert(0, '.')
from redteam import *
# Same workflow as aboveKey settings in Config class:
model.model_name: Path to local HuggingFace model (default: "gpt-oss-20b")model.device: Compute device (default: "cuda")model.dtype: Model precision (default: "bfloat16")model.max_new_tokens: Generation length limit (default: 256)model.temperature: Sampling temperature (default: 0.2)model.use_harmony_chat_template: Use HF chat template formatting (default: True)run.limit_attempts: Number of generation attempts (default: 200)run.exploration_policy: Bandit algorithm (default: "ucb1")judge.threshold: Score threshold for flagging issues (default: 0.6)
NEW: The framework now uses a modular architecture that reduces token usage from 29K+ to ~3K tokens.
Streamlined Notebook: notebooks/red_team_streamlined.ipynb (~3K tokens)
- Uses
from redteam import *for all functionality - Same features as the consolidated notebook
- 90% token reduction for better Claude Code performance
Legacy Indexing: For the original consolidated notebook (if needed):
# List all available sections and functions
python tools/extract_notebook_section.py --list-sections
python tools/extract_notebook_section.py --list-functions
# Extract specific sections (reduces 29K tokens to 2-5K)
python tools/extract_notebook_section.py 6 # Judging system
python tools/extract_notebook_section.py 8 # Main generation loop
python tools/extract_notebook_section.py 4 # Model backend
# Extract by function/class name
python tools/extract_notebook_section.py Config # Configuration
python tools/extract_notebook_section.py run_red_team_batch # Main execution
python tools/extract_notebook_section.py combined_judge # Scoring system
python tools/extract_notebook_section.py visualize_results # Analysis toolsModule Access: Work with individual components directly:
from redteam.core import Config
from redteam.execution import run_red_team_batch
from redteam.analysis import visualize_resultsSection Overview:
- Section 1: Dependencies & Imports
- Section 2: Configuration Classes (
Config,ModelConfig, etc.) - Section 3: Utility Functions (token management, helpers)
- Section 4: Model Backend (
OllamaRunner,HuggingFaceRunner) - Section 5: Seed Messages & Mutators (
topic_seed_messages,vary) - Section 6: Judging & Scoring System (heuristic + LLM evaluation)
- Section 7: Multi-Armed Bandit & Deduplication (
UCB1,LSHDeduplicator) - Section 8: Enhanced Main Generation Loop (
run_red_team_batch) - Section 9: Visualization & Analysis Tools (
visualize_results) - Section 10: Export to Kaggle Format
- Section 11: Results and Testing
Index Reference: See notebooks/notebook_index.md for complete function-to-section mapping.
Results exported to JSON files following Kaggle schema with fields:
- Issue metadata (title, category, severity assessment)
- Model parameters and environment details
- Full interaction transcripts
- Reproduction steps and analysis notes
Primary Development: Notebook-based All development and testing should be done through the consolidated notebook:
- Load
notebooks/red_team_consolidated.ipynb - Modify configuration in the config cell
- Test individual components using the defined functions in the notebook
- Run analysis and visualization tools built into the notebook
Testing Individual Components in Notebook:
# Test configuration
cfg = Config()
cfg.model.model_name = "your-model-path"
print(f"Model: {cfg.model.model_name}")
# Test seed prompts
families = topic_seed_messages()
print(f"Available families: {[f[0] for f in families]}")
# Test judges
score = combined_judge("Test response")
print(f"Combined score: {score}")For any custom testing or modifications, work directly within the consolidated notebook cells.
Output Locations:
- Generated candidates:
artifacts/directory - Submission files:
artifacts/submission/ - Debug logs: Console output during notebook execution
Comprehensive Test Suite Available - The framework now includes extensive testing:
# Run critical component tests (recommended before major changes)
python tests/test_critical_fixes.py
# Run determinism verification (ensures reproducible results)
python tests/test_determinism.py
# Run Harmony format compliance tests
python tests/test_harmony_spec_compliance.py
# Run judge consistency tests
python tests/test_judge_consistency.pyManual Testing Components:
# Test configuration
from redteam.config import Config
cfg = Config()
print(f"Model: {cfg.model.model_name}")
# Test seed prompts
from redteam.seeds import topic_seed_messages
families = topic_seed_messages()
print(f"Available families: {[f[0] for f in families]}")
# Test judges individually
from redteam.judges.heuristic import judge as h_judge
score = h_judge("Test response")
print(f"Heuristic score: {score}")Environment Verification:
# Check CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# Verify HuggingFace transformers
python -c "from transformers import AutoTokenizer; print('HF transformers working')"
# Test notebook dependencies
python -c "import matplotlib, seaborn, tqdm; print('Notebook deps available')"Performance Testing:
- Use
limit=10inrun_red_team_batch()for quick testing - Monitor GPU memory usage during generation
- Check deduplication effectiveness in debug output
Adding New Safety Topics:
- Edit the
topic_seed_messages()function in notebook cell 10 - Add new tuple with (topic_name, [message_list])
- Test with small batch first:
run_red_team_batch(cfg, limit=20)
Modifying Judge Logic:
- Edit
heuristic_flags()function in cell 12 for pattern-based detection - Adjust
llm_judge()function in cell 12 for LLM-based scoring - Update scoring weights in
combined_judge()if needed
Changing Exploration Strategy:
- Modify the
UCB1class in cell 14 for different algorithms - Update config
exploration_policy(currently only UCB1 implemented) - Alternative bandits can be implemented in the same cell
Custom Export Formats:
- Modify export functions in cell 20 for different schemas
- Update schema constants at the top of cell 20
- Test export with
export_to_kaggle()function
- When commiting to github, do not include Generated with [Claude Code]. If you must, say it in a light hummored way.
- Use NotebookReadTool to read the jyputer notebook files