Skip to content

Latest commit

 

History

History
1002 lines (762 loc) · 27 KB

File metadata and controls

1002 lines (762 loc) · 27 KB

FORYOU.md - Step-by-Step Learning Guide for Shinka

This guide will take you from zero to hero with the ShinkaEvolve framework, starting with simple Azure OpenAI integration tests, progressing through examples, and ending with creating your own evolution experiments.


Table of Contents

  1. Prerequisites
  2. Phase 1: Environment Setup & Azure OpenAI Verification
  3. Phase 2: Running Example 1 - Circle Packing
  4. Phase 3: Running Example 2 - Novelty Generator
  5. Phase 4: Running Example 3 - Agent Design (ADAS AIME)
  6. Phase 5: Creating Your Own Evolution Experiment
  7. SOP: Standard Operating Procedure for New Use Cases
  8. Troubleshooting

Prerequisites

Required Software

  • Python 3.11+ installed
  • uv package manager (recommended) or pip
  • Git
  • Azure OpenAI resource with deployments configured

Required Credentials

  • Azure AD service principal (tenant ID, client ID, client secret) OR
  • Azure OpenAI API key
  • Azure OpenAI endpoint URL

Phase 1: Environment Setup & Azure OpenAI Verification

Step 1.1: Install Dependencies

# Navigate to project directory
cd /Users/samcc/Documents/CodexProject/ShinkaEvolve

# Install using uv (recommended)
uv venv --python 3.11
source .venv/bin/activate
uv pip install -e .

# OR using pip
pip install -e .

Step 1.2: Configure Azure OpenAI Credentials

Edit the .env file in the project root:

# Minimum required configuration (choose ONE authentication method)

# Option 1: OAuth2 (Recommended for Production)
AZURE_TENANT_ID=your-tenant-id-here
AZURE_CLIENT_ID=your-client-id-here
AZURE_CLIENT_SECRET=your-client-secret-here
AZURE_API_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_API_VERSION=2024-02-01

# Option 2: API Key (Simple/Development)
AZURE_OPENAI_API_KEY=your-api-key-here
AZURE_API_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_API_VERSION=2024-02-01

# Model Deployment Mappings (IMPORTANT!)
# Map OpenAI model names to your Azure deployment names
AZURE_MODEL_DEPLOYMENTS={"gpt-4.1-mini": "your-gpt4-mini-deployment-name"}
AZURE_EMBEDDING_DEPLOYMENTS={"text-embedding-3-small": "your-embedding-deployment-name"}

Finding Your Deployment Names:

  1. Go to Azure Portal → Azure OpenAI Studio
  2. Click Deployments in left sidebar
  3. Copy the Deployment name column values
  4. Paste into the JSON mapping above

Step 1.3: Test Azure OpenAI Integration (Simple Test)

Create a test file test_azure.py:

#!/usr/bin/env python3
"""Simple test to verify Azure OpenAI integration works."""

from shinka.llm import LLMClient

def test_llm_client():
    """Test basic LLM client functionality."""
    print("🧪 Testing Azure OpenAI LLM Client...")

    # Create client with your model
    client = LLMClient(
        model_names=["gpt-4.1-mini"],  # Will map to your Azure deployment
        temperatures=0.7,
        max_tokens=100,
    )

    # Test query
    result = client.query(
        msg="Say 'Hello from Azure OpenAI!' and nothing else.",
        system_msg="You are a helpful assistant.",
    )

    if result and result.content:
        print(f"✅ SUCCESS! Response: {result.content}")
        print(f"💰 Cost: ${result.cost:.4f}")
        print(f"📊 Tokens: {result.input_tokens} in, {result.output_tokens} out")
        return True
    else:
        print("❌ FAILED! No response received.")
        return False

def test_embedding_client():
    """Test embedding client functionality."""
    print("\n🧪 Testing Azure OpenAI Embedding Client...")

    from shinka.llm import EmbeddingClient

    client = EmbeddingClient(
        model_name="text-embedding-3-small",  # Will map to your deployment
    )

    # Test embedding
    embedding, cost = client.get_embedding("Hello, Azure OpenAI!")

    if embedding and len(embedding) > 0:
        print(f"✅ SUCCESS! Embedding dimension: {len(embedding)}")
        print(f"💰 Cost: ${cost:.6f}")
        return True
    else:
        print("❌ FAILED! No embedding received.")
        return False

if __name__ == "__main__":
    print("=" * 60)
    print("AZURE OPENAI INTEGRATION TEST")
    print("=" * 60)

    llm_success = test_llm_client()
    embed_success = test_embedding_client()

    print("\n" + "=" * 60)
    if llm_success and embed_success:
        print("🎉 ALL TESTS PASSED! Azure OpenAI is configured correctly.")
        print("You can now proceed to running examples.")
    else:
        print("⚠️  Some tests failed. Check your .env configuration.")
        print("See CLAUDE.md for detailed troubleshooting steps.")
    print("=" * 60)

Run the test:

python test_azure.py

Expected Output:

============================================================
AZURE OPENAI INTEGRATION TEST
============================================================
🧪 Testing Azure OpenAI LLM Client...
✅ SUCCESS! Response: Hello from Azure OpenAI!
💰 Cost: $0.0012
📊 Tokens: 45 in, 8 out

🧪 Testing Azure OpenAI Embedding Client...
✅ SUCCESS! Embedding dimension: 1536
💰 Cost: $0.000002

============================================================
🎉 ALL TESTS PASSED! Azure OpenAI is configured correctly.
You can now proceed to running examples.
============================================================

If tests fail: See Troubleshooting section.


Phase 2: Running Example 1 - Circle Packing

Objective: Evolve code to pack 26 circles in a unit square, maximizing the sum of their radii.

Difficulty: ⭐⭐ (Beginner)

What you'll learn:

  • Basic evolution loop
  • How EVOLVE-BLOCK markers work
  • Understanding evaluation scripts
  • Monitoring evolution progress
  • Using the WebUI for visualization

Step 2.1: Understand the Problem

The goal is to find the best arrangement of 26 circles in a 1x1 square such that:

  • All circles are inside the square
  • No circles overlap
  • The sum of radii is maximized (best known: ~2.635)

Step 2.2: Review the Initial Solution

Open examples/circle_packing/initial.py and look for:

# EVOLVE-BLOCK-START
def construct_packing():
    """Construct arrangement of 26 circles"""
    # This code will be evolved by LLMs
    # ...
# EVOLVE-BLOCK-END

Key Points:

  • Code between EVOLVE-BLOCK-START/END will be modified by LLMs
  • Code outside these markers stays unchanged
  • Initial solution is intentionally simple (LLMs will improve it)

Step 2.3: Review the Evaluation Script

Open examples/circle_packing/evaluate.py:

def adapted_validate_packing(run_output):
    """Check if circles are valid (no overlap, inside square)"""
    centers, radii, reported_sum = run_output
    # Returns (is_valid: bool, error_message: str or None)
    # ...

def aggregate_metrics(results, results_dir):
    """Compute fitness score"""
    return {
        "combined_score": float(reported_sum),  # Higher = better
        "public": {...},   # Visible in logs/WebUI
        "private": {...},  # Internal use only
    }

Key Points:

  • validate_fn: Ensures solutions are correct (no overlaps, in bounds)
  • aggregate_fn: Computes the fitness score (combined_score)
  • Higher combined_score = better solution (evolution maximizes this)

Step 2.4: Configure Evolution Parameters

Edit examples/circle_packing/run_evo.py:

# Minimal configuration for testing
db_config = DatabaseConfig(
    num_islands=2,        # Start with 2 islands
    archive_size=20,      # Keep top 20 solutions
)

evo_config = EvolutionConfig(
    num_generations=5,    # Start small (5 generations)
    max_parallel_jobs=1,  # 1 job at a time
    llm_models=["gpt-4.1-mini"],  # Your Azure deployment
    init_program_path="initial.py",
)

Step 2.5: Run Evolution

cd examples/circle_packing
python run_evo.py

Expected Output:

[SHINKA LOGO]
==> GENERATION 1/5
==> SAMPLING: gpt-4.1-mini
==> PATCH APPLIED: gen_1_patch_0
==> EVALUATING: gen_1_patch_0
==> RESULT: combined_score=1.234, valid=True
...
==> GENERATION 5/5 COMPLETE
Best solution: combined_score=2.145 (improved from 1.023!)

Step 2.6: Monitor with WebUI (Optional)

In another terminal:

cd /Users/samcc/Documents/CodexProject/ShinkaEvolve
shinka_visualize --port 8888 --open

This opens a browser showing:

  • Evolution progress in real-time
  • Genealogy tree (which solutions evolved from which)
  • Metrics over generations
  • Code diffs between generations

Step 2.7: Review Results

After evolution completes, check the results directory:

ls -lah results_*
# Contains:
# - evolution_db.sqlite (database with all solutions)
# - gen_1/, gen_2/, ... (code for each generation)
# - best_solution.py (best evolved code)

Open the best solution:

cat results_*/best_solution.py
# See how the LLM improved the code!

What to Look For:

  • Did the combined_score improve over generations?
  • What strategies did the LLM try? (check different generation folders)
  • How does the best solution differ from initial?

Phase 3: Running Example 2 - Novelty Generator

Objective: Generate creative, novel ASCII art or text patterns.

Difficulty: ⭐⭐⭐ (Intermediate)

What you'll learn:

  • Novelty-based evolution (not just optimization)
  • Using LLM judges for evaluation
  • Open-ended exploration
  • Text feedback in evolution loop

Step 3.1: Understand Novelty Search

Unlike circle packing (maximize score), novelty search aims to find diverse and creative solutions. Each solution is judged on:

  • Novelty: How different is it from previous solutions?
  • Quality: How interesting/surprising is it?

Step 3.2: Review the Initial Code

cat examples/novelty_generator/initial.py

Look for the EVOLVE-BLOCK that generates patterns.

Step 3.3: Configure Novelty Evolution

Create examples/novelty_generator/run_evo.py:

from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.database import DatabaseConfig
from shinka.launch import LocalJobConfig

job_config = LocalJobConfig(eval_program_path="evaluate.py")

db_config = DatabaseConfig(
    num_islands=2,
    archive_size=30,  # Keep more diverse solutions
)

evo_config = EvolutionConfig(
    num_generations=10,
    max_parallel_jobs=1,
    llm_models=["gpt-4.1-mini"],
    init_program_path="initial.py",

    # Novelty-specific settings
    code_embed_sim_threshold=0.8,  # Similarity threshold
    max_novelty_attempts=3,         # Retries for novel solutions
    novelty_llm_models=["gpt-4.1-mini"],  # LLM judges
    use_text_feedback=True,         # Use text feedback in evolution
)

runner = EvolutionRunner(
    evo_config=evo_config,
    job_config=job_config,
    db_config=db_config,
)
runner.run()

Step 3.4: Run Novelty Evolution

cd examples/novelty_generator
python run_evo.py

Expected Behavior:

  • LLM generates diverse ASCII art patterns
  • Each generation tries something different
  • LLM judge evaluates creativity/novelty
  • Archive fills with diverse solutions (not just "best" one)

Step 3.5: Review Diverse Solutions

# Check different evolved patterns
cat results_*/gen_*/evolved_*.py

# See the archive of diverse solutions
sqlite3 results_*/evolution_db.sqlite "SELECT id, score, novelty_score FROM programs ORDER BY novelty_score DESC LIMIT 10;"

What to Notice:

  • Solutions are diverse (not converging to one pattern)
  • Novelty scores vary
  • Text feedback helps guide evolution toward interesting directions

Phase 4: Running Example 3 - Agent Design (ADAS AIME)

Objective: Evolve agent scaffolds to solve math competition problems (AIME dataset).

Difficulty: ⭐⭐⭐⭐ (Advanced)

What you'll learn:

  • Complex multi-step evaluation
  • Agent scaffolding evolution
  • Working with external datasets
  • Multi-metric optimization

Step 4.1: Understand the Task

AIME (American Invitational Mathematics Examination) problems are challenging math problems. The goal is to evolve an agent scaffold (prompt templates, reasoning strategies) that helps solve these problems.

Step 4.2: Review the Dataset

head examples/adas_aime/AIME_Dataset_1983_2025.csv

Each row contains:

  • Problem text
  • Correct answer
  • Year/difficulty

Step 4.3: Review Initial Agent Scaffold

cat examples/adas_aime/initial.py

Look for agent components:

  • Prompt templates
  • Reasoning strategies
  • Answer extraction methods

Step 4.4: Run Agent Evolution

cd examples/adas_aime
python run_evo.py

Note: This example requires more compute (solving math problems is expensive). Consider:

  • Reducing dataset size in evaluate.py
  • Using faster model: gpt-4.1-mini instead of gpt-4.1
  • Increasing max_parallel_jobs if you have quota

Step 4.5: Monitor Progress

This evolution takes longer. Watch for:

  • Accuracy improvements over generations
  • Different reasoning strategies tried by LLMs
  • Prompt engineering evolution (how agent asks itself questions)
# Check current best accuracy
tail -f results_*/evolution.log | grep "combined_score"

Step 4.6: Analyze Discovered Agents

ls examples/adas_aime/discovered/

# Review successful agent strategies
cat examples/adas_aime/discovered/1_gen15_cot_and_majority_voting_repackaged.py

What to Learn:

  • What strategies emerged? (Chain-of-thought? Self-reflection? Voting?)
  • How did prompts evolve over generations?
  • Which strategies generalize best to unseen problems?

Phase 5: Creating Your Own Evolution Experiment

Objective: Apply Shinka to your own optimization problem.

Difficulty: ⭐⭐⭐⭐⭐ (Expert)

Step 5.1: Define Your Problem

Ask yourself:

  1. What am I optimizing? (e.g., algorithm speed, solution quality, creative output)
  2. How do I measure success? (fitness function)
  3. What constraints exist? (e.g., correctness requirements)
  4. Is it optimization or novelty search?

Example Problems:

  • Optimize a sorting algorithm for specific data patterns
  • Generate creative product descriptions
  • Evolve hyperparameters for a ML model
  • Design API architectures
  • Generate test cases for edge cases

Step 5.2: Create Project Structure

mkdir -p examples/my_experiment
cd examples/my_experiment
touch initial.py evaluate.py run_evo.py

Step 5.3: Write Initial Solution (initial.py)

# EVOLVE-BLOCK-START
def my_algorithm(input_data):
    """
    Your starting algorithm implementation.
    This will be evolved by LLMs.

    Args:
        input_data: Your problem input

    Returns:
        result: Your problem output
    """
    # Simple/naive implementation here
    # LLMs will improve this!
    result = simple_solution(input_data)
    return result
# EVOLVE-BLOCK-END

def run_experiment(**kwargs):
    """
    Entry point called by evaluation script.

    Args:
        **kwargs: Parameters from get_experiment_kwargs

    Returns:
        Results that will be validated and scored
    """
    input_data = kwargs.get("input_data")
    result = my_algorithm(input_data)
    return result

# Helper functions (outside EVOLVE-BLOCK - won't be evolved)
def simple_solution(data):
    # Your baseline implementation
    return data

Key Rules:

  • Put code to evolve inside EVOLVE-BLOCK-START/END
  • Keep helper functions outside (they won't change)
  • Implement run_experiment(**kwargs) as entry point
  • Return results in format expected by validator

Step 5.4: Write Evaluation Script (evaluate.py)

#!/usr/bin/env python3
import argparse
from shinka.core import run_shinka_eval

def get_experiment_kwargs(run_idx: int) -> dict:
    """
    Generate kwargs for each evaluation run.

    Args:
        run_idx: Index of this run (0, 1, 2, ...)

    Returns:
        Dict of kwargs to pass to run_experiment
    """
    # Example: Load different test cases
    test_cases = [
        {"input_data": [1, 2, 3]},
        {"input_data": [5, 4, 3, 2, 1]},
        {"input_data": [10, 20, 30]},
    ]
    return test_cases[run_idx % len(test_cases)]

def validate_solution(run_output):
    """
    Check if solution is valid/correct.

    Args:
        run_output: Return value from run_experiment

    Returns:
        (is_valid: bool, error_msg: str or None)
    """
    result = run_output

    # Check correctness constraints
    if result is None:
        return False, "Result is None"

    # Add your validation logic
    if not meets_requirements(result):
        return False, "Does not meet requirements"

    return True, None

def aggregate_metrics(results: list, results_dir: str) -> dict:
    """
    Compute fitness score from all runs.

    Args:
        results: List of run_output from all runs
        results_dir: Directory to save extra data

    Returns:
        Dict with required keys:
        - combined_score: float (PRIMARY FITNESS - higher=better)
        - public: dict (visible in logs/WebUI)
        - private: dict (internal use)
        - text_feedback: str (optional, for LLM feedback)
    """
    # Compute your fitness metric
    scores = [compute_score(r) for r in results]
    avg_score = sum(scores) / len(scores)

    return {
        "combined_score": float(avg_score),  # MUST be present
        "public": {
            "average_score": avg_score,
            "num_runs": len(results),
        },
        "private": {
            "individual_scores": scores,
        },
        # Optional: guide LLM with text feedback
        "text_feedback": f"Score: {avg_score:.2f}. Try improving X.",
    }

def compute_score(result):
    """Your scoring logic"""
    return 1.0  # Replace with actual score

def meets_requirements(result):
    """Your validation logic"""
    return True  # Replace with actual checks

def main(program_path: str, results_dir: str):
    """Main evaluation function called by Shinka"""
    metrics, correct, error_msg = run_shinka_eval(
        program_path=program_path,
        results_dir=results_dir,
        experiment_fn_name="run_experiment",
        num_runs=3,  # Run 3 times and aggregate
        get_experiment_kwargs=get_experiment_kwargs,
        validate_fn=validate_solution,
        aggregate_metrics_fn=aggregate_metrics,
    )

    # Print results (optional, for debugging)
    if correct:
        print(f"✅ Valid solution: {metrics['combined_score']:.4f}")
    else:
        print(f"❌ Invalid solution: {error_msg}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--program_path", type=str, required=True)
    parser.add_argument("--results_dir", type=str, required=True)
    args = parser.parse_args()

    main(args.program_path, args.results_dir)

Step 5.5: Write Evolution Runner (run_evo.py)

#!/usr/bin/env python3
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.database import DatabaseConfig
from shinka.launch import LocalJobConfig

# Configure job execution
job_config = LocalJobConfig(
    eval_program_path="evaluate.py",
)

# Configure evolution database
db_config = DatabaseConfig(
    num_islands=4,
    archive_size=50,
    migration_interval=10,
    parent_selection_strategy="power_law",
    exploitation_alpha=1.0,
)

# Task-specific guidance for LLM
task_description = """You are optimizing [DESCRIBE YOUR PROBLEM].

Key insights:
1. [Important strategy 1]
2. [Important strategy 2]
3. [Known pitfalls to avoid]

Focus on [SPECIFIC GOALS].
"""

# Configure evolution parameters
evo_config = EvolutionConfig(
    num_generations=20,
    max_parallel_jobs=2,
    llm_models=["gpt-4.1-mini"],  # Or multiple models
    task_sys_msg=task_description,
    init_program_path="initial.py",
    patch_types=["diff", "full"],
    patch_type_probs=[0.7, 0.3],  # 70% diff, 30% full rewrites
)

# Run evolution
runner = EvolutionRunner(
    evo_config=evo_config,
    job_config=job_config,
    db_config=db_config,
)
runner.run()

Step 5.6: Test Your Setup

Before running full evolution, test each component:

# Test 1: Initial solution runs
python initial.py
# Should execute without errors

# Test 2: Evaluation script works
python evaluate.py --program_path initial.py --results_dir ./test_results
# Should output: ✅ Valid solution: X.XXXX

# Test 3: Quick evolution (1 generation)
# Edit run_evo.py: num_generations=1
python run_evo.py
# Should complete one evolution cycle

Step 5.7: Run Full Evolution

# Edit run_evo.py: num_generations=20 (or more)
python run_evo.py

# Monitor in another terminal
shinka_visualize --port 8888 --open

Step 5.8: Analyze Results

# Check evolution database
import sqlite3
conn = sqlite3.connect("results_*/evolution_db.sqlite")

# Get top solutions
top_solutions = conn.execute("""
    SELECT id, combined_score, generation, island_id
    FROM programs
    WHERE is_correct = 1
    ORDER BY combined_score DESC
    LIMIT 10
""").fetchall()

for sol_id, score, gen, island in top_solutions:
    print(f"Solution {sol_id}: score={score:.4f}, gen={gen}, island={island}")

SOP: Standard Operating Procedure for New Use Cases

Follow this checklist when starting a new evolution experiment:

☑️ Planning Phase

  • Define the problem clearly

    • What are you optimizing?
    • What's the input/output format?
    • What are success criteria?
  • Choose evolution type

    • Optimization (maximize/minimize score)
    • Novelty search (explore diverse solutions)
    • Multi-objective (balance multiple goals)
  • Design fitness function

    • How do you score solutions? (higher = better)
    • What makes a solution "valid"?
    • Can you test automatically?

☑️ Implementation Phase

  • Create project directory

    mkdir -p examples/my_project
    cd examples/my_project
  • Write initial.py

    • Implement baseline algorithm
    • Mark evolution sections with EVOLVE-BLOCK-START/END
    • Implement run_experiment(**kwargs) entry point
    • Test: python initial.py runs without errors
  • Write evaluate.py

    • Implement get_experiment_kwargs(run_idx)
    • Implement validate_solution(run_output)
    • Implement aggregate_metrics(results, results_dir)
    • Test: python evaluate.py --program_path initial.py --results_dir ./test works
  • Write run_evo.py

    • Configure DatabaseConfig (islands, archive size)
    • Configure EvolutionConfig (generations, models, task description)
    • Configure LocalJobConfig (or Slurm for cluster)
    • Test: Run with num_generations=1 completes

☑️ Configuration Phase

  • Tune evolution parameters

    • Start small: num_generations=5, num_islands=2
    • Choose appropriate model(s): balance cost vs. performance
    • Write good task_sys_msg: guide LLM with domain knowledge
    • Set patch_types: ["diff"] for incremental, ["full"] for rewrites
  • Configure Azure OpenAI

    • Update .env with model deployment mappings
    • Test with test_azure.py first
    • Monitor costs: check combined_score vs. API costs

☑️ Execution Phase

  • Run small test

    • num_generations=1-3
    • Verify evolution loop works
    • Check one solution improves
  • Scale up gradually

    • num_generations=10-20
    • Monitor WebUI for progress
    • Check for convergence or diversity
  • Full evolution run

    • num_generations=50-100 (depending on problem)
    • Use max_parallel_jobs to speed up
    • Save results to version control

☑️ Analysis Phase

  • Review best solutions

    • Check results_*/best_solution.py
    • Compare to initial: what strategies emerged?
    • Test generalization on held-out data
  • Visualize evolution

    • Use WebUI to see genealogy tree
    • Plot combined_score over generations
    • Identify innovation moments (big jumps)
  • Extract insights

    • What patterns/strategies did LLMs discover?
    • Which are human-interpretable?
    • Which generalize to new problems?

☑️ Iteration Phase

  • Refine based on results

    • Update task_sys_msg with discovered insights
    • Adjust fitness function if needed
    • Try different parent selection strategies
  • Experiment with variations

    • Different LLM models
    • Different island configurations
    • Different patch type distributions

Troubleshooting

Azure OpenAI Connection Issues

Symptom: "Authentication failed" or "Unauthorized"

Solutions:

# Check OAuth2 credentials
python -c "import os; from dotenv import load_dotenv; load_dotenv(); print(f'Tenant: {os.getenv(\"AZURE_TENANT_ID\")}')"

# Verify service principal has role
az role assignment list --assignee $AZURE_CLIENT_ID

# Test API key fallback
export AZURE_OPENAI_API_KEY=your-key
python test_azure.py

Deployment Not Found

Symptom: "The API deployment for this resource does not exist"

Solutions:

# List your actual deployments
az cognitiveservices account deployment list \
  --name your-openai-resource \
  --resource-group your-rg

# Update .env mapping
AZURE_MODEL_DEPLOYMENTS={"gpt-4.1-mini": "actual-deployment-name"}

Evolution Not Improving

Symptom: combined_score stays flat across generations

Solutions:

  1. Check task description: Is task_sys_msg clear and helpful?
  2. Check fitness function: Does it differentiate good vs. bad solutions?
  3. Check initial solution: Is it too good already (ceiling effect)?
  4. Try different strategy: Change parent_selection_strategy
  5. Increase diversity: Use more islands, larger archive

Evaluation Failures

Symptom: Most solutions marked is_correct=False

Solutions:

  1. Check validation logic: Is it too strict?
  2. Check EVOLVE-BLOCK scope: Are critical functions inside/outside?
  3. Add guardrails: Give LLM clearer constraints in task_sys_msg
  4. Review failed solutions: What mistakes are common?

High Costs

Symptom: Azure bill is higher than expected

Solutions:

  1. Use smaller model: gpt-4.1-mini instead of gpt-4.1
  2. Reduce generations: Start with 10-20, not 100
  3. Reduce parallel jobs: Lower max_parallel_jobs
  4. Monitor cost per generation: Check $ in logs
  5. Set budget alerts: In Azure Portal

Next Steps

Now that you've completed all phases:

  1. Explore advanced features:

    • Meta-recommendations (evolution of evolution)
    • Dynamic model selection
    • Multi-objective optimization
  2. Scale to clusters:

    • Configure Slurm for large-scale experiments
    • Use SlurmCondaJobConfig or SlurmDockerJobConfig
  3. Contribute discoveries:

    • Share successful strategies
    • Report interesting emergent behaviors
    • Submit PRs with new examples
  4. Join the community:

    • GitHub issues for questions
    • Share your results

Happy Evolving! 🧬

For detailed reference, see: