A local LLM benchmark evaluation pipeline built with CrewAI and Ollama.
No API keys required — everything runs on-device via llama3.1:8b.
Six specialized agents collaborate in a ReAct loop to run inference, parse results, compute statistics, analyze per-question errors, compare against public baselines, and write a full evaluation report with an animated pipeline visualization.
| Pipeline Timeline GIF | ReAct Process GIF |
|---|---|
outputs/arc_challenge/pipeline_flow.gif |
outputs/visuals/react_process.gif |
main.py ──→ Crew (sequential)
│
├── 1. Benchmark Runner run_or_load_benchmark
├── 2. Benchmark Data Parser parse_results
├── 3. Statistical Analyst compute_stats
├── 4. QA Detail Reviewer generate_qa_detail
├── 5. Comparative Analyst get_bigbench_baselines
└── 6. Report Publisher file_writer + generate_pipeline_gif
Each agent uses llama3.1:8b via Ollama (litellm bridge), temperature=0.
A stdout interceptor captures all tool calls and LLM outputs into a structured JSONL log in real time.
| Benchmark | Dataset | Task | Random baseline |
|---|---|---|---|
causal_judgement |
lukaemon/bbh |
Yes/No causal reasoning | 50% |
arc_challenge |
allenai/ai2_arc ARC-Challenge |
4-choice science QA | 25% |
Switch benchmarks by changing one line in main.py:
BENCHMARK = "arc_challenge" # or "causal_judgement"| Model | Accuracy | Source |
|---|---|---|
| Human | 91.8% | Clark et al. (2018) |
| GPT-4 | 90.4% | OpenAI (2023) |
| GPT-3.5-turbo | 79.9% | Public eval |
| PaLM 2 (540B) | 78.0% | Google (2023) |
| LLaMA-2-70B | 67.9% | Meta (2023) |
| llama3.1:8b (this run) | 76.7% | This project |
| LLaMA-3.1-8B (reported) | ~59.5% | Meta (2024) |
| Random baseline | 25.0% | — |
| Model | Accuracy |
|---|---|
| GPT-4 | 67% |
| llama3.1:8b (this run) | 50% |
| Random baseline | 50% |
- Python 3.10+
- Ollama running locally with
llama3.1:8bpulled:ollama pull llama3.1:8b ollama serve # must be running on localhost:11434
pip install -r requirements.txtCreate a .env file (content can be empty — Ollama needs no key):
touch .envpython main.pyTo switch benchmark, edit main.py line 14:
BENCHMARK = "arc_challenge" # causal_judgement | arc_challengeEach run produces a self-contained folder under outputs/{benchmark}/.
If data/{benchmark}/results.jsonl already exists, inference is skipped — only the analysis pipeline re-runs.
benchmark_crew/
│
├── main.py # Entry point — configure BENCHMARK here
├── agents.py # 6 CrewAI agent definitions
├── tasks.py # 6 task definitions with context chains
├── pipeline_logger.py # Stdout interceptor → structured JSONL log
├── _viz_worker.py # GIF renderer (runs in subprocess to avoid pydantic conflict)
├── make_react_gif.py # Standalone: generate ReAct process animation
├── run_benchmark.py # Legacy standalone inference script (superseded)
├── requirements.txt
│
├── tools/
│ ├── benchmark_tool.py # Run or load inference (causal_judgement / arc_challenge)
│ ├── stats_tool.py # parse_results + compute_stats
│ ├── qa_tool.py # Per-question detail table generator
│ ├── baseline_tool.py # Static public baseline lookup
│ ├── viz_tool.py # Calls _viz_worker.py as subprocess
│ └── json_tool.py # Raw JSONL reader
│
├── data/
│ ├── causal_judgement/
│ │ └── results.jsonl # 20 samples, accuracy 50%
│ └── arc_challenge/
│ └── results.jsonl # 30 samples, accuracy 76.7%
│
└── outputs/
├── causal_judgement/
│ ├── llama3.1_8b_report.md
│ ├── pipeline_flow.gif
│ ├── pipeline_log.jsonl
│ ├── pipeline_summary.json
│ └── qa_detail.md
├── arc_challenge/
│ ├── llama3.1_8b_report.md
│ ├── pipeline_flow.gif
│ ├── pipeline_log.jsonl
│ ├── pipeline_summary.json
│ └── qa_detail.md
└── visuals/
└── react_process.gif
| File | Description |
|---|---|
llama3.1_8b_report.md |
Full evaluation report (summary, stats table, error analysis, baseline comparison) |
qa_detail.md |
Per-question breakdown: full question text, model raw reply, extracted answer, correct/wrong |
pipeline_flow.gif |
Animated swimlane timeline — agent bars, tool call markers, scrolling event log, stage results panel |
pipeline_log.jsonl |
Structured event log: task_start, tool_call, tool_result, agent_final, task_end |
pipeline_summary.json |
Run metadata: model, benchmark, elapsed time, output paths |
- Benchmark Runner checks if
data/{benchmark}/results.jsonlexists. If not, runs Ollama inference over the dataset (30 samples by default) and writes results. - Data Parser reads the JSONL and reports record count, field names, category distribution, score stats, and sample records.
- Statistical Analyst computes per-category accuracy, std, pass@1, and lists error examples.
- QA Detail Reviewer generates a per-question Markdown table comparing model answers to ground truth.
- Comparative Analyst looks up published scores for the same benchmark and positions llama3.1:8b relative to them.
- Report Publisher calls the GIF renderer (via subprocess) then writes the final Markdown report using
FileWriterTool.
The pipeline logger intercepts crewai's verbose stdout in real time using regex, parsing agent names, tool calls, tool results, and final answers into timestamped JSONL events — without modifying crewai internals.