This guide will take you from zero to hero with the ShinkaEvolve framework, starting with simple Azure OpenAI integration tests, progressing through examples, and ending with creating your own evolution experiments.
- Prerequisites
- Phase 1: Environment Setup & Azure OpenAI Verification
- Phase 2: Running Example 1 - Circle Packing
- Phase 3: Running Example 2 - Novelty Generator
- Phase 4: Running Example 3 - Agent Design (ADAS AIME)
- Phase 5: Creating Your Own Evolution Experiment
- SOP: Standard Operating Procedure for New Use Cases
- Troubleshooting
- Python 3.11+ installed
uvpackage manager (recommended) orpip- Git
- Azure OpenAI resource with deployments configured
- Azure AD service principal (tenant ID, client ID, client secret) OR
- Azure OpenAI API key
- Azure OpenAI endpoint URL
# Navigate to project directory
cd /Users/samcc/Documents/CodexProject/ShinkaEvolve
# Install using uv (recommended)
uv venv --python 3.11
source .venv/bin/activate
uv pip install -e .
# OR using pip
pip install -e .Edit the .env file in the project root:
# Minimum required configuration (choose ONE authentication method)
# Option 1: OAuth2 (Recommended for Production)
AZURE_TENANT_ID=your-tenant-id-here
AZURE_CLIENT_ID=your-client-id-here
AZURE_CLIENT_SECRET=your-client-secret-here
AZURE_API_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_API_VERSION=2024-02-01
# Option 2: API Key (Simple/Development)
AZURE_OPENAI_API_KEY=your-api-key-here
AZURE_API_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_API_VERSION=2024-02-01
# Model Deployment Mappings (IMPORTANT!)
# Map OpenAI model names to your Azure deployment names
AZURE_MODEL_DEPLOYMENTS={"gpt-4.1-mini": "your-gpt4-mini-deployment-name"}
AZURE_EMBEDDING_DEPLOYMENTS={"text-embedding-3-small": "your-embedding-deployment-name"}Finding Your Deployment Names:
- Go to Azure Portal → Azure OpenAI Studio
- Click Deployments in left sidebar
- Copy the Deployment name column values
- Paste into the JSON mapping above
Create a test file test_azure.py:
#!/usr/bin/env python3
"""Simple test to verify Azure OpenAI integration works."""
from shinka.llm import LLMClient
def test_llm_client():
"""Test basic LLM client functionality."""
print("🧪 Testing Azure OpenAI LLM Client...")
# Create client with your model
client = LLMClient(
model_names=["gpt-4.1-mini"], # Will map to your Azure deployment
temperatures=0.7,
max_tokens=100,
)
# Test query
result = client.query(
msg="Say 'Hello from Azure OpenAI!' and nothing else.",
system_msg="You are a helpful assistant.",
)
if result and result.content:
print(f"✅ SUCCESS! Response: {result.content}")
print(f"💰 Cost: ${result.cost:.4f}")
print(f"📊 Tokens: {result.input_tokens} in, {result.output_tokens} out")
return True
else:
print("❌ FAILED! No response received.")
return False
def test_embedding_client():
"""Test embedding client functionality."""
print("\n🧪 Testing Azure OpenAI Embedding Client...")
from shinka.llm import EmbeddingClient
client = EmbeddingClient(
model_name="text-embedding-3-small", # Will map to your deployment
)
# Test embedding
embedding, cost = client.get_embedding("Hello, Azure OpenAI!")
if embedding and len(embedding) > 0:
print(f"✅ SUCCESS! Embedding dimension: {len(embedding)}")
print(f"💰 Cost: ${cost:.6f}")
return True
else:
print("❌ FAILED! No embedding received.")
return False
if __name__ == "__main__":
print("=" * 60)
print("AZURE OPENAI INTEGRATION TEST")
print("=" * 60)
llm_success = test_llm_client()
embed_success = test_embedding_client()
print("\n" + "=" * 60)
if llm_success and embed_success:
print("🎉 ALL TESTS PASSED! Azure OpenAI is configured correctly.")
print("You can now proceed to running examples.")
else:
print("⚠️ Some tests failed. Check your .env configuration.")
print("See CLAUDE.md for detailed troubleshooting steps.")
print("=" * 60)Run the test:
python test_azure.pyExpected Output:
============================================================
AZURE OPENAI INTEGRATION TEST
============================================================
🧪 Testing Azure OpenAI LLM Client...
✅ SUCCESS! Response: Hello from Azure OpenAI!
💰 Cost: $0.0012
📊 Tokens: 45 in, 8 out
🧪 Testing Azure OpenAI Embedding Client...
✅ SUCCESS! Embedding dimension: 1536
💰 Cost: $0.000002
============================================================
🎉 ALL TESTS PASSED! Azure OpenAI is configured correctly.
You can now proceed to running examples.
============================================================
If tests fail: See Troubleshooting section.
Objective: Evolve code to pack 26 circles in a unit square, maximizing the sum of their radii.
Difficulty: ⭐⭐ (Beginner)
What you'll learn:
- Basic evolution loop
- How EVOLVE-BLOCK markers work
- Understanding evaluation scripts
- Monitoring evolution progress
- Using the WebUI for visualization
The goal is to find the best arrangement of 26 circles in a 1x1 square such that:
- All circles are inside the square
- No circles overlap
- The sum of radii is maximized (best known: ~2.635)
Open examples/circle_packing/initial.py and look for:
# EVOLVE-BLOCK-START
def construct_packing():
"""Construct arrangement of 26 circles"""
# This code will be evolved by LLMs
# ...
# EVOLVE-BLOCK-ENDKey Points:
- Code between
EVOLVE-BLOCK-START/ENDwill be modified by LLMs - Code outside these markers stays unchanged
- Initial solution is intentionally simple (LLMs will improve it)
Open examples/circle_packing/evaluate.py:
def adapted_validate_packing(run_output):
"""Check if circles are valid (no overlap, inside square)"""
centers, radii, reported_sum = run_output
# Returns (is_valid: bool, error_message: str or None)
# ...
def aggregate_metrics(results, results_dir):
"""Compute fitness score"""
return {
"combined_score": float(reported_sum), # Higher = better
"public": {...}, # Visible in logs/WebUI
"private": {...}, # Internal use only
}Key Points:
validate_fn: Ensures solutions are correct (no overlaps, in bounds)aggregate_fn: Computes the fitness score (combined_score)- Higher
combined_score= better solution (evolution maximizes this)
Edit examples/circle_packing/run_evo.py:
# Minimal configuration for testing
db_config = DatabaseConfig(
num_islands=2, # Start with 2 islands
archive_size=20, # Keep top 20 solutions
)
evo_config = EvolutionConfig(
num_generations=5, # Start small (5 generations)
max_parallel_jobs=1, # 1 job at a time
llm_models=["gpt-4.1-mini"], # Your Azure deployment
init_program_path="initial.py",
)cd examples/circle_packing
python run_evo.pyExpected Output:
[SHINKA LOGO]
==> GENERATION 1/5
==> SAMPLING: gpt-4.1-mini
==> PATCH APPLIED: gen_1_patch_0
==> EVALUATING: gen_1_patch_0
==> RESULT: combined_score=1.234, valid=True
...
==> GENERATION 5/5 COMPLETE
Best solution: combined_score=2.145 (improved from 1.023!)
In another terminal:
cd /Users/samcc/Documents/CodexProject/ShinkaEvolve
shinka_visualize --port 8888 --openThis opens a browser showing:
- Evolution progress in real-time
- Genealogy tree (which solutions evolved from which)
- Metrics over generations
- Code diffs between generations
After evolution completes, check the results directory:
ls -lah results_*
# Contains:
# - evolution_db.sqlite (database with all solutions)
# - gen_1/, gen_2/, ... (code for each generation)
# - best_solution.py (best evolved code)Open the best solution:
cat results_*/best_solution.py
# See how the LLM improved the code!What to Look For:
- Did the
combined_scoreimprove over generations? - What strategies did the LLM try? (check different generation folders)
- How does the best solution differ from initial?
Objective: Generate creative, novel ASCII art or text patterns.
Difficulty: ⭐⭐⭐ (Intermediate)
What you'll learn:
- Novelty-based evolution (not just optimization)
- Using LLM judges for evaluation
- Open-ended exploration
- Text feedback in evolution loop
Unlike circle packing (maximize score), novelty search aims to find diverse and creative solutions. Each solution is judged on:
- Novelty: How different is it from previous solutions?
- Quality: How interesting/surprising is it?
cat examples/novelty_generator/initial.pyLook for the EVOLVE-BLOCK that generates patterns.
Create examples/novelty_generator/run_evo.py:
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.database import DatabaseConfig
from shinka.launch import LocalJobConfig
job_config = LocalJobConfig(eval_program_path="evaluate.py")
db_config = DatabaseConfig(
num_islands=2,
archive_size=30, # Keep more diverse solutions
)
evo_config = EvolutionConfig(
num_generations=10,
max_parallel_jobs=1,
llm_models=["gpt-4.1-mini"],
init_program_path="initial.py",
# Novelty-specific settings
code_embed_sim_threshold=0.8, # Similarity threshold
max_novelty_attempts=3, # Retries for novel solutions
novelty_llm_models=["gpt-4.1-mini"], # LLM judges
use_text_feedback=True, # Use text feedback in evolution
)
runner = EvolutionRunner(
evo_config=evo_config,
job_config=job_config,
db_config=db_config,
)
runner.run()cd examples/novelty_generator
python run_evo.pyExpected Behavior:
- LLM generates diverse ASCII art patterns
- Each generation tries something different
- LLM judge evaluates creativity/novelty
- Archive fills with diverse solutions (not just "best" one)
# Check different evolved patterns
cat results_*/gen_*/evolved_*.py
# See the archive of diverse solutions
sqlite3 results_*/evolution_db.sqlite "SELECT id, score, novelty_score FROM programs ORDER BY novelty_score DESC LIMIT 10;"What to Notice:
- Solutions are diverse (not converging to one pattern)
- Novelty scores vary
- Text feedback helps guide evolution toward interesting directions
Objective: Evolve agent scaffolds to solve math competition problems (AIME dataset).
Difficulty: ⭐⭐⭐⭐ (Advanced)
What you'll learn:
- Complex multi-step evaluation
- Agent scaffolding evolution
- Working with external datasets
- Multi-metric optimization
AIME (American Invitational Mathematics Examination) problems are challenging math problems. The goal is to evolve an agent scaffold (prompt templates, reasoning strategies) that helps solve these problems.
head examples/adas_aime/AIME_Dataset_1983_2025.csvEach row contains:
- Problem text
- Correct answer
- Year/difficulty
cat examples/adas_aime/initial.pyLook for agent components:
- Prompt templates
- Reasoning strategies
- Answer extraction methods
cd examples/adas_aime
python run_evo.pyNote: This example requires more compute (solving math problems is expensive). Consider:
- Reducing dataset size in
evaluate.py - Using faster model:
gpt-4.1-miniinstead ofgpt-4.1 - Increasing
max_parallel_jobsif you have quota
This evolution takes longer. Watch for:
- Accuracy improvements over generations
- Different reasoning strategies tried by LLMs
- Prompt engineering evolution (how agent asks itself questions)
# Check current best accuracy
tail -f results_*/evolution.log | grep "combined_score"ls examples/adas_aime/discovered/
# Review successful agent strategies
cat examples/adas_aime/discovered/1_gen15_cot_and_majority_voting_repackaged.pyWhat to Learn:
- What strategies emerged? (Chain-of-thought? Self-reflection? Voting?)
- How did prompts evolve over generations?
- Which strategies generalize best to unseen problems?
Objective: Apply Shinka to your own optimization problem.
Difficulty: ⭐⭐⭐⭐⭐ (Expert)
Ask yourself:
- What am I optimizing? (e.g., algorithm speed, solution quality, creative output)
- How do I measure success? (fitness function)
- What constraints exist? (e.g., correctness requirements)
- Is it optimization or novelty search?
Example Problems:
- Optimize a sorting algorithm for specific data patterns
- Generate creative product descriptions
- Evolve hyperparameters for a ML model
- Design API architectures
- Generate test cases for edge cases
mkdir -p examples/my_experiment
cd examples/my_experiment
touch initial.py evaluate.py run_evo.py# EVOLVE-BLOCK-START
def my_algorithm(input_data):
"""
Your starting algorithm implementation.
This will be evolved by LLMs.
Args:
input_data: Your problem input
Returns:
result: Your problem output
"""
# Simple/naive implementation here
# LLMs will improve this!
result = simple_solution(input_data)
return result
# EVOLVE-BLOCK-END
def run_experiment(**kwargs):
"""
Entry point called by evaluation script.
Args:
**kwargs: Parameters from get_experiment_kwargs
Returns:
Results that will be validated and scored
"""
input_data = kwargs.get("input_data")
result = my_algorithm(input_data)
return result
# Helper functions (outside EVOLVE-BLOCK - won't be evolved)
def simple_solution(data):
# Your baseline implementation
return dataKey Rules:
- Put code to evolve inside
EVOLVE-BLOCK-START/END - Keep helper functions outside (they won't change)
- Implement
run_experiment(**kwargs)as entry point - Return results in format expected by validator
#!/usr/bin/env python3
import argparse
from shinka.core import run_shinka_eval
def get_experiment_kwargs(run_idx: int) -> dict:
"""
Generate kwargs for each evaluation run.
Args:
run_idx: Index of this run (0, 1, 2, ...)
Returns:
Dict of kwargs to pass to run_experiment
"""
# Example: Load different test cases
test_cases = [
{"input_data": [1, 2, 3]},
{"input_data": [5, 4, 3, 2, 1]},
{"input_data": [10, 20, 30]},
]
return test_cases[run_idx % len(test_cases)]
def validate_solution(run_output):
"""
Check if solution is valid/correct.
Args:
run_output: Return value from run_experiment
Returns:
(is_valid: bool, error_msg: str or None)
"""
result = run_output
# Check correctness constraints
if result is None:
return False, "Result is None"
# Add your validation logic
if not meets_requirements(result):
return False, "Does not meet requirements"
return True, None
def aggregate_metrics(results: list, results_dir: str) -> dict:
"""
Compute fitness score from all runs.
Args:
results: List of run_output from all runs
results_dir: Directory to save extra data
Returns:
Dict with required keys:
- combined_score: float (PRIMARY FITNESS - higher=better)
- public: dict (visible in logs/WebUI)
- private: dict (internal use)
- text_feedback: str (optional, for LLM feedback)
"""
# Compute your fitness metric
scores = [compute_score(r) for r in results]
avg_score = sum(scores) / len(scores)
return {
"combined_score": float(avg_score), # MUST be present
"public": {
"average_score": avg_score,
"num_runs": len(results),
},
"private": {
"individual_scores": scores,
},
# Optional: guide LLM with text feedback
"text_feedback": f"Score: {avg_score:.2f}. Try improving X.",
}
def compute_score(result):
"""Your scoring logic"""
return 1.0 # Replace with actual score
def meets_requirements(result):
"""Your validation logic"""
return True # Replace with actual checks
def main(program_path: str, results_dir: str):
"""Main evaluation function called by Shinka"""
metrics, correct, error_msg = run_shinka_eval(
program_path=program_path,
results_dir=results_dir,
experiment_fn_name="run_experiment",
num_runs=3, # Run 3 times and aggregate
get_experiment_kwargs=get_experiment_kwargs,
validate_fn=validate_solution,
aggregate_metrics_fn=aggregate_metrics,
)
# Print results (optional, for debugging)
if correct:
print(f"✅ Valid solution: {metrics['combined_score']:.4f}")
else:
print(f"❌ Invalid solution: {error_msg}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--program_path", type=str, required=True)
parser.add_argument("--results_dir", type=str, required=True)
args = parser.parse_args()
main(args.program_path, args.results_dir)#!/usr/bin/env python3
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.database import DatabaseConfig
from shinka.launch import LocalJobConfig
# Configure job execution
job_config = LocalJobConfig(
eval_program_path="evaluate.py",
)
# Configure evolution database
db_config = DatabaseConfig(
num_islands=4,
archive_size=50,
migration_interval=10,
parent_selection_strategy="power_law",
exploitation_alpha=1.0,
)
# Task-specific guidance for LLM
task_description = """You are optimizing [DESCRIBE YOUR PROBLEM].
Key insights:
1. [Important strategy 1]
2. [Important strategy 2]
3. [Known pitfalls to avoid]
Focus on [SPECIFIC GOALS].
"""
# Configure evolution parameters
evo_config = EvolutionConfig(
num_generations=20,
max_parallel_jobs=2,
llm_models=["gpt-4.1-mini"], # Or multiple models
task_sys_msg=task_description,
init_program_path="initial.py",
patch_types=["diff", "full"],
patch_type_probs=[0.7, 0.3], # 70% diff, 30% full rewrites
)
# Run evolution
runner = EvolutionRunner(
evo_config=evo_config,
job_config=job_config,
db_config=db_config,
)
runner.run()Before running full evolution, test each component:
# Test 1: Initial solution runs
python initial.py
# Should execute without errors
# Test 2: Evaluation script works
python evaluate.py --program_path initial.py --results_dir ./test_results
# Should output: ✅ Valid solution: X.XXXX
# Test 3: Quick evolution (1 generation)
# Edit run_evo.py: num_generations=1
python run_evo.py
# Should complete one evolution cycle# Edit run_evo.py: num_generations=20 (or more)
python run_evo.py
# Monitor in another terminal
shinka_visualize --port 8888 --open# Check evolution database
import sqlite3
conn = sqlite3.connect("results_*/evolution_db.sqlite")
# Get top solutions
top_solutions = conn.execute("""
SELECT id, combined_score, generation, island_id
FROM programs
WHERE is_correct = 1
ORDER BY combined_score DESC
LIMIT 10
""").fetchall()
for sol_id, score, gen, island in top_solutions:
print(f"Solution {sol_id}: score={score:.4f}, gen={gen}, island={island}")Follow this checklist when starting a new evolution experiment:
-
Define the problem clearly
- What are you optimizing?
- What's the input/output format?
- What are success criteria?
-
Choose evolution type
- Optimization (maximize/minimize score)
- Novelty search (explore diverse solutions)
- Multi-objective (balance multiple goals)
-
Design fitness function
- How do you score solutions? (higher = better)
- What makes a solution "valid"?
- Can you test automatically?
-
Create project directory
mkdir -p examples/my_project cd examples/my_project -
Write
initial.py- Implement baseline algorithm
- Mark evolution sections with
EVOLVE-BLOCK-START/END - Implement
run_experiment(**kwargs)entry point - Test:
python initial.pyruns without errors
-
Write
evaluate.py- Implement
get_experiment_kwargs(run_idx) - Implement
validate_solution(run_output) - Implement
aggregate_metrics(results, results_dir) - Test:
python evaluate.py --program_path initial.py --results_dir ./testworks
- Implement
-
Write
run_evo.py- Configure
DatabaseConfig(islands, archive size) - Configure
EvolutionConfig(generations, models, task description) - Configure
LocalJobConfig(or Slurm for cluster) - Test: Run with
num_generations=1completes
- Configure
-
Tune evolution parameters
- Start small:
num_generations=5,num_islands=2 - Choose appropriate model(s): balance cost vs. performance
- Write good
task_sys_msg: guide LLM with domain knowledge - Set
patch_types:["diff"]for incremental,["full"]for rewrites
- Start small:
-
Configure Azure OpenAI
- Update
.envwith model deployment mappings - Test with
test_azure.pyfirst - Monitor costs: check
combined_scorevs. API costs
- Update
-
Run small test
num_generations=1-3- Verify evolution loop works
- Check one solution improves
-
Scale up gradually
num_generations=10-20- Monitor WebUI for progress
- Check for convergence or diversity
-
Full evolution run
num_generations=50-100(depending on problem)- Use
max_parallel_jobsto speed up - Save results to version control
-
Review best solutions
- Check
results_*/best_solution.py - Compare to initial: what strategies emerged?
- Test generalization on held-out data
- Check
-
Visualize evolution
- Use WebUI to see genealogy tree
- Plot
combined_scoreover generations - Identify innovation moments (big jumps)
-
Extract insights
- What patterns/strategies did LLMs discover?
- Which are human-interpretable?
- Which generalize to new problems?
-
Refine based on results
- Update
task_sys_msgwith discovered insights - Adjust fitness function if needed
- Try different parent selection strategies
- Update
-
Experiment with variations
- Different LLM models
- Different island configurations
- Different patch type distributions
Symptom: "Authentication failed" or "Unauthorized"
Solutions:
# Check OAuth2 credentials
python -c "import os; from dotenv import load_dotenv; load_dotenv(); print(f'Tenant: {os.getenv(\"AZURE_TENANT_ID\")}')"
# Verify service principal has role
az role assignment list --assignee $AZURE_CLIENT_ID
# Test API key fallback
export AZURE_OPENAI_API_KEY=your-key
python test_azure.pySymptom: "The API deployment for this resource does not exist"
Solutions:
# List your actual deployments
az cognitiveservices account deployment list \
--name your-openai-resource \
--resource-group your-rg
# Update .env mapping
AZURE_MODEL_DEPLOYMENTS={"gpt-4.1-mini": "actual-deployment-name"}Symptom: combined_score stays flat across generations
Solutions:
- Check task description: Is
task_sys_msgclear and helpful? - Check fitness function: Does it differentiate good vs. bad solutions?
- Check initial solution: Is it too good already (ceiling effect)?
- Try different strategy: Change
parent_selection_strategy - Increase diversity: Use more islands, larger archive
Symptom: Most solutions marked is_correct=False
Solutions:
- Check validation logic: Is it too strict?
- Check EVOLVE-BLOCK scope: Are critical functions inside/outside?
- Add guardrails: Give LLM clearer constraints in
task_sys_msg - Review failed solutions: What mistakes are common?
Symptom: Azure bill is higher than expected
Solutions:
- Use smaller model:
gpt-4.1-miniinstead ofgpt-4.1 - Reduce generations: Start with 10-20, not 100
- Reduce parallel jobs: Lower
max_parallel_jobs - Monitor cost per generation: Check
$in logs - Set budget alerts: In Azure Portal
Now that you've completed all phases:
-
Explore advanced features:
- Meta-recommendations (evolution of evolution)
- Dynamic model selection
- Multi-objective optimization
-
Scale to clusters:
- Configure Slurm for large-scale experiments
- Use
SlurmCondaJobConfigorSlurmDockerJobConfig
-
Contribute discoveries:
- Share successful strategies
- Report interesting emergent behaviors
- Submit PRs with new examples
-
Join the community:
- GitHub issues for questions
- Share your results
Happy Evolving! 🧬
For detailed reference, see:
- CLAUDE.md - Technical reference
- docs/getting_started.md - Installation guide
- docs/configuration.md - Configuration options
- docs/webui.md - WebUI guide