agent eval

Part of BlackRoad OS — Sovereign Computing for Everyone

agent eval is part of the BlackRoad OS ecosystem — a sovereign, distributed operating system built on edge computing, local AI, and mesh networking by BlackRoad OS, Inc.

About BlackRoad OS

BlackRoad OS is a sovereign computing platform that runs AI locally on your own hardware. No cloud dependencies. No API keys. No surveillance. Built by BlackRoad OS, Inc., a Delaware C-Corp founded in 2025.

Key Features

Local AI — Run LLMs on Raspberry Pi, Hailo-8, and commodity hardware
Mesh Networking — WireGuard VPN, NATS pub/sub, peer-to-peer communication
Edge Computing — 52 TOPS of AI acceleration across a Pi fleet
Self-Hosted Everything — Git, DNS, storage, CI/CD, chat — all sovereign
Zero Cloud Dependencies — Your data stays on your hardware

The BlackRoad Ecosystem

Organization	Focus
BlackRoad OS	Core platform and applications
BlackRoad OS, Inc.	Corporate and enterprise
BlackRoad AI	Artificial intelligence and ML
BlackRoad Hardware	Edge hardware and IoT
BlackRoad Security	Cybersecurity and auditing
BlackRoad Quantum	Quantum computing research
BlackRoad Agents	Autonomous AI agents
BlackRoad Network	Mesh and distributed networking
BlackRoad Education	Learning and tutoring platforms
BlackRoad Labs	Research and experiments
BlackRoad Cloud	Self-hosted cloud infrastructure
BlackRoad Forge	Developer tools and utilities

Links

Website: blackroad.io
Documentation: docs.blackroad.io
Chat: chat.blackroad.io
Search: search.blackroad.io

Evaluation framework for BlackRoad OS agents. Tests agent responses against a suite of test cases and scores accuracy, latency, and relevance.

What This Is

A Python evaluation harness that sends prompts to agents via Ollama, compares responses against expected keywords, and produces a scored report. Used to validate that agents respond correctly and consistently.

Requirements

Python 3.6+
Ollama running locally (or specify --host)
curl

Usage

# Run all 20 test cases
python3 eval.py

# Use a specific model
python3 eval.py --model codellama

# Test only one agent
python3 eval.py --agent coder

# Save report to file
python3 eval.py --output eval-report.json --verbose

# Custom test file
python3 eval.py --tests my_tests.json

Options

Flag	Default	Description
`--tests`	test_cases.json	Path to test cases file
`--model`	llama3.2	Ollama model to evaluate
`--host`	http://localhost:11434	Ollama API endpoint
`--output`	stdout	Output file for JSON report
`--verbose`	false	Include response previews
`--agent`	all	Filter to specific agent

Test Case Format

[
  {
    "agent": "coder",
    "prompt": "Write a Python function that checks if a number is prime.",
    "expected_keywords": ["def", "prime", "return", "True", "False"]
  }
]

Scoring

Accuracy: Percentage of expected keywords found in the response (0-100%)
Relevance: Heuristic score based on response length, structure, and prompt overlap (0-100%)
Latency: Wall-clock time for the Ollama request in seconds
Pass/Fail: A test passes if keyword accuracy >= 50%

Included Test Cases

20 test cases across 10 agents covering code generation, research, math, writing, security, teaching, networking, DevOps, inference, and monitoring.

Part of BlackRoad-Agents. Remember the Road. Pave Tomorrow. Incorporated 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
test_cases.json		test_cases.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent eval

About BlackRoad OS

Key Features

The BlackRoad Ecosystem

Links

What This Is

Requirements

Usage

Options

Test Case Format

Scoring

Included Test Cases

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent eval

About BlackRoad OS

Key Features

The BlackRoad Ecosystem

Links

What This Is

Requirements

Usage

Options

Test Case Format

Scoring

Included Test Cases

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages