benchmark_crew

A local LLM benchmark evaluation pipeline built with CrewAI and Ollama. No API keys required — everything runs on-device via llama3.1:8b.

Six specialized agents collaborate in a ReAct loop to run inference, parse results, compute statistics, analyze per-question errors, compare against public baselines, and write a full evaluation report with an animated pipeline visualization.

Demo

Pipeline Timeline GIF	ReAct Process GIF
`outputs/arc_challenge/pipeline_flow.gif`	`outputs/visuals/react_process.gif`

Architecture

main.py  ──→  Crew (sequential)
              │
              ├── 1. Benchmark Runner       run_or_load_benchmark
              ├── 2. Benchmark Data Parser  parse_results
              ├── 3. Statistical Analyst    compute_stats
              ├── 4. QA Detail Reviewer     generate_qa_detail
              ├── 5. Comparative Analyst    get_bigbench_baselines
              └── 6. Report Publisher       file_writer + generate_pipeline_gif

Each agent uses llama3.1:8b via Ollama (litellm bridge), temperature=0. A stdout interceptor captures all tool calls and LLM outputs into a structured JSONL log in real time.

Supported Benchmarks

Benchmark	Dataset	Task	Random baseline
`causal_judgement`	`lukaemon/bbh`	Yes/No causal reasoning	50%
`arc_challenge`	`allenai/ai2_arc` ARC-Challenge	4-choice science QA	25%

Switch benchmarks by changing one line in main.py:

BENCHMARK = "arc_challenge"   # or "causal_judgement"

Results

ARC-Challenge (30 samples)

Model	Accuracy	Source
Human	91.8%	Clark et al. (2018)
GPT-4	90.4%	OpenAI (2023)
GPT-3.5-turbo	79.9%	Public eval
PaLM 2 (540B)	78.0%	Google (2023)
LLaMA-2-70B	67.9%	Meta (2023)
llama3.1:8b (this run)	76.7%	This project
LLaMA-3.1-8B (reported)	~59.5%	Meta (2024)
Random baseline	25.0%	—

causal_judgement (20 samples)

Model	Accuracy
GPT-4	67%
llama3.1:8b (this run)	50%
Random baseline	50%

Prerequisites

Python 3.10+

Ollama running locally with llama3.1:8b pulled:

ollama pull llama3.1:8b
ollama serve          # must be running on localhost:11434

Setup

pip install -r requirements.txt

Create a .env file (content can be empty — Ollama needs no key):

touch .env

Usage

python main.py

To switch benchmark, edit main.py line 14:

BENCHMARK = "arc_challenge"     # causal_judgement | arc_challenge

Each run produces a self-contained folder under outputs/{benchmark}/. If data/{benchmark}/results.jsonl already exists, inference is skipped — only the analysis pipeline re-runs.

Project Structure

benchmark_crew/
│
├── main.py                  # Entry point — configure BENCHMARK here
├── agents.py                # 6 CrewAI agent definitions
├── tasks.py                 # 6 task definitions with context chains
├── pipeline_logger.py       # Stdout interceptor → structured JSONL log
├── _viz_worker.py           # GIF renderer (runs in subprocess to avoid pydantic conflict)
├── make_react_gif.py        # Standalone: generate ReAct process animation
├── run_benchmark.py         # Legacy standalone inference script (superseded)
├── requirements.txt
│
├── tools/
│   ├── benchmark_tool.py    # Run or load inference (causal_judgement / arc_challenge)
│   ├── stats_tool.py        # parse_results + compute_stats
│   ├── qa_tool.py           # Per-question detail table generator
│   ├── baseline_tool.py     # Static public baseline lookup
│   ├── viz_tool.py          # Calls _viz_worker.py as subprocess
│   └── json_tool.py         # Raw JSONL reader
│
├── data/
│   ├── causal_judgement/
│   │   └── results.jsonl    # 20 samples, accuracy 50%
│   └── arc_challenge/
│       └── results.jsonl    # 30 samples, accuracy 76.7%
│
└── outputs/
    ├── causal_judgement/
    │   ├── llama3.1_8b_report.md
    │   ├── pipeline_flow.gif
    │   ├── pipeline_log.jsonl
    │   ├── pipeline_summary.json
    │   └── qa_detail.md
    ├── arc_challenge/
    │   ├── llama3.1_8b_report.md
    │   ├── pipeline_flow.gif
    │   ├── pipeline_log.jsonl
    │   ├── pipeline_summary.json
    │   └── qa_detail.md
    └── visuals/
        └── react_process.gif

Output Files

File	Description
`llama3.1_8b_report.md`	Full evaluation report (summary, stats table, error analysis, baseline comparison)
`qa_detail.md`	Per-question breakdown: full question text, model raw reply, extracted answer, correct/wrong
`pipeline_flow.gif`	Animated swimlane timeline — agent bars, tool call markers, scrolling event log, stage results panel
`pipeline_log.jsonl`	Structured event log: `task_start`, `tool_call`, `tool_result`, `agent_final`, `task_end`
`pipeline_summary.json`	Run metadata: model, benchmark, elapsed time, output paths

How It Works

Benchmark Runner checks if data/{benchmark}/results.jsonl exists. If not, runs Ollama inference over the dataset (30 samples by default) and writes results.
Data Parser reads the JSONL and reports record count, field names, category distribution, score stats, and sample records.
Statistical Analyst computes per-category accuracy, std, pass@1, and lists error examples.
QA Detail Reviewer generates a per-question Markdown table comparing model answers to ground truth.
Comparative Analyst looks up published scores for the same benchmark and positions llama3.1:8b relative to them.
Report Publisher calls the GIF renderer (via subprocess) then writes the final Markdown report using FileWriterTool.

The pipeline logger intercepts crewai's verbose stdout in real time using regex, parsing agent names, tool calls, tool results, and final answers into timestamped JSONL events — without modifying crewai internals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

benchmark_crew

Demo

Architecture

Supported Benchmarks

Results

ARC-Challenge (30 samples)

causal_judgement (20 samples)

Prerequisites

Setup

Usage

Project Structure

Output Files

How It Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
data		data
outputs		outputs
tools		tools
.DS_Store		.DS_Store
.env		.env
.env.example		.env.example
README.md		README.md
_viz_worker.py		_viz_worker.py
agents.py		agents.py
main.py		main.py
make_react_gif.py		make_react_gif.py
pipeline_logger.py		pipeline_logger.py
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
tasks.py		tasks.py

Folders and files

Latest commit

History

Repository files navigation

benchmark_crew

Demo

Architecture

Supported Benchmarks

Results

ARC-Challenge (30 samples)

causal_judgement (20 samples)

Prerequisites

Setup

Usage

Project Structure

Output Files

How It Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages