A comprehensive framework for evaluating and generating character-based AI responses using multiple large language models. This research tool enables systematic evaluation of character role-playing capabilities across different AI models through dilemma resolution, canonical event generation, and multiversal dialogue creation.
- Character Dilemma: Evaluate how different AI models handle moral and ethical dilemmas in character-specific contexts.
- Canonical Event: Evaluate characters responses to canonical events.
- Multiversal Dialogue: Create cross-character interactions and dialogues #future work.
- Automated Scoring: Evaluate character consistency and role-playing quality using AI judges.
- label platform: for experts to construct dataset.
from huggingface_hub import snapshot_download
#augustus2011/beyond_one_world-dilemma,augustus2011/beyond_one_world-cannon,augustus2011/beyond_one_world-heros
snapshot_download(repo_id="augustus2011/beyond_one_world-dilemma", local_dir="ultrachat_local", local_dir_use_symlinks=False, repo_type="dataset")charactor_ai/
├── tools.py # Core utilities and model management
├── generate_answer.py # Main generation script (synchronous)
├── generate_answer_con.py # Concurrent generation script
├── generate_meme_question.py # Dialogue generation (synchronous)
├── generate_meme_question_con.py # Concurrent dialogue generation
├── scoring.py # Character response scoring
├── scoring_con.py # Concurrent scoring
├── match_think_act.py # Think-act pair extraction
├── match_think_act_con.py # Concurrent think-act matching
├── label_platform1.py # Annotation platform
├── label_platform2.py # Enhanced annotation platform
├── model/
│ └── LLM.py #abstract class to be implemented with your llm-model
├── scripts/
│ └── clean.py # clean text
├── generated_results/ # Output directory
│ ├── canon/ # Canonical event results
│ ├── dilemma/ # Dilemma resolution results
│ └── multiversal_dialogue/ # Dialogue results
├── annotated_results/ # Human annotations
├── all_character_data.csv # Character dataset
├── character_dilemmas.json # required dataset on our hf
├── characters_canon_events.json # required dataset on our hf
├── heros_profile_aa.csv # required dataset on our hf
└── requirements.txt # Dependencies
- Python 3.8+
- Clone the repository:
https://github.com/Augustus2011/Beyond_One_World.git
cd Beyond_One_World- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
# Create .env file with your API keys
CLAUDE_KEY=your_claude_api_key
GEMINI_API=your_gemini_api_key
OPENAI_API_KEY=your_openai_api_key
R1_API_KEY=your_r1_api_keyGenerate character responses to moral dilemmas:
# Basic dilemma generation
python generate_answer.py --model gemini2 --data character_dilemmas.json --output ./generated_results/dilemma_gemini2.jsonl --task dilemma
# With chain-of-thought reasoning
python generate_answer.py --model sonnet3-7-think --data character_dilemmas.json --output ./generated_results/dilemma_claude_thinking.jsonl --task dilemma --cot
# Clean consequence text
python generate_answer.py --model r1 --data character_dilemmas.json --output ./generated_results/dilemma_r1_clean.jsonl --task dilemma --clean_consequenceGenerate character responses to canonical events:
# Basic canon event generation
python generate_answer.py --model sonnet3-7 --data characters_canon_events.json --output ./generated_results/canon_claude.jsonl --task canon
# With thinking capabilities
python generate_answer.py --model gemini2-5-think --data characters_canon_events.json --output ./generated_results/canon_gemini_thinking.jsonl --task canonFor large-scale processing, use the concurrent versions:
# Concurrent dilemma generation
python generate_answer_con.py --model r1 --data character_dilemmas.json --output ./generated_results/dilemma_r1_concurrent.jsonl --task dilemma --max_con 10
# Concurrent canon generation
python generate_answer_con.py --model sonnet3-7 --data character_dilemmas.json --output ./generated_results/canon_claude_concurrent.jsonl --task canon --max_con 8# Synchronous dialogue generation
python generate_meme_question.py
# Concurrent dialogue generation
python generate_meme_question_con.pyEvaluate character consistency and role-playing quality:
# Basic scoring
python scoring.py --input ./generated_results/dilemma_gemini2.jsonl --output ./scored_results/dilemma_gemini2_scored.json --cdata ./heros_profile_aa.csv
# Concurrent scoring
python scoring_con.py --input ./generated_results/dilemma_gemini2.jsonl --output ./scored_results/dilemma_gemini2_scored_concurrent.json --cdata ./heros_profile_aa.csv --max-concurrent 6Extract thinking and acting components from responses:
# Basic extraction
python match_think_act.py input_file.json output_file.json
# Concurrent extraction
python match_think_act_con.py input_directory/Retry failed API calls:
python generate_answer.py --model r1 --output ./generated_results/dilemma_r1_fixed.jsonl --task dilemma --apierror --inputfile ./generated_results/dilemma_r1_failed.jsonlModels are configured in tools.py:
def get_model(model_name):
models = {
"your_model":your_model("your_huggingface_model_path").generate,
"gemini2": google_gemini.gemini2_flash,
"gemini2-5": google_gemini.gemini2_5_flash,
"gemini2-5-think": google_gemini.gemini2_5_flash_thinking,
"sonnet3-7": sonnet.sonnet_37,
"sonnet3-7-think": sonnet.sonnet_37_thinking,
"sonnet3-5": sonnet.sonnet_35,
"judge": sonnet.sonnet_37_judge,
"gen-think": gpt4o_mini,
"r1": hyperbolic.r1,
"v3": hyperbolic.deepseek_v3,
}
return models.get(model_name, None)docker-compose up --build
@misc{2510.14351,
Author = {Perapard Ngokpol and Kun Kerdthaisong and Pasin Buakhaw and Pitikorn Khlaisamniang and Supasate Vorathammathorn and Piyalitt Ittichaiwong and Nutchanon Yongsatianchot},
Title = {Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts},
Year = {2025},
Eprint = {arXiv:2510.14351},
}