Name	Name	Last commit message	Last commit date
parent directory ..
assets	assets
writing_services	writing_services
README.md	README.md
__init__.py	__init__.py
analyze_stats.py	analyze_stats.py
create_benchmark_graph.py	create_benchmark_graph.py
hidden_logger.py	hidden_logger.py
logger.py	logger.py

Name

Last commit message

Last commit date

create_benchmark_graph.py

hidden_logger.py

logger.py

Writing services eval

Writing a service is a Claude Code Skill that I am testing. In my skill prompt, I have a variety of requirements that need to be followed to create a service in the way that I like, and my evaluation ensures that the skill actually adheres to my requirements.

Uses the Claude Code Agent SDK to simulate a Claude Code session
Experiment tracking is done through a logger service to measure the effect of prompt changes

Insights

Haiku 4.5 is not as good as Sonnet 4.5 at instruction following for simple skills

GLM-4.7 is better than Haiku but not as good as Sonnet

Using OpenRouter's Claude Code Integration I tested GLM-4.7 with the same Claude Code harness.

NOTE: I haven't changed the prompt for this; given that this is a different model family, it may require different prompting.

Cost to run GLM-4.7 on eval: $2.75

Tips

Running the eval

for i in {1..20}; do echo "=== Run $i/20 ===" && make run-services-eval && jj abandon -r @; done

If using an openrouter model, source .env.openrouterclaude.

Generating a graph

uv run python jack-software/evals/create_benchmark_graph.py -o jack-software/evals/assets/benchmark_comparison_glm_4_7.png -r 24dd1ad:Haiku -r 11de541:GLM-4.7 -r 6cf39c8:Sonnet --db evals.db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Writing services eval

Insights

Haiku 4.5 is not as good as Sonnet 4.5 at instruction following for simple skills

GLM-4.7 is better than Haiku but not as good as Sonnet

Tips

FilesExpand file tree

evals

Directory actions

More options

Directory actions

More options

Latest commit

History

evals

Folders and files

parent directory

README.md

Writing services eval

Insights

Haiku 4.5 is not as good as Sonnet 4.5 at instruction following for simple skills

GLM-4.7 is better than Haiku but not as good as Sonnet

Tips