Skip to content

Dynamite2003/benchmark_crew

Repository files navigation

benchmark_crew

A local LLM benchmark evaluation pipeline built with CrewAI and Ollama. No API keys required — everything runs on-device via llama3.1:8b.

Six specialized agents collaborate in a ReAct loop to run inference, parse results, compute statistics, analyze per-question errors, compare against public baselines, and write a full evaluation report with an animated pipeline visualization.


Demo

Pipeline Timeline GIF ReAct Process GIF
outputs/arc_challenge/pipeline_flow.gif outputs/visuals/react_process.gif

Architecture

main.py  ──→  Crew (sequential)
              │
              ├── 1. Benchmark Runner       run_or_load_benchmark
              ├── 2. Benchmark Data Parser  parse_results
              ├── 3. Statistical Analyst    compute_stats
              ├── 4. QA Detail Reviewer     generate_qa_detail
              ├── 5. Comparative Analyst    get_bigbench_baselines
              └── 6. Report Publisher       file_writer + generate_pipeline_gif

Each agent uses llama3.1:8b via Ollama (litellm bridge), temperature=0. A stdout interceptor captures all tool calls and LLM outputs into a structured JSONL log in real time.


Supported Benchmarks

Benchmark Dataset Task Random baseline
causal_judgement lukaemon/bbh Yes/No causal reasoning 50%
arc_challenge allenai/ai2_arc ARC-Challenge 4-choice science QA 25%

Switch benchmarks by changing one line in main.py:

BENCHMARK = "arc_challenge"   # or "causal_judgement"

Results

ARC-Challenge (30 samples)

Model Accuracy Source
Human 91.8% Clark et al. (2018)
GPT-4 90.4% OpenAI (2023)
GPT-3.5-turbo 79.9% Public eval
PaLM 2 (540B) 78.0% Google (2023)
LLaMA-2-70B 67.9% Meta (2023)
llama3.1:8b (this run) 76.7% This project
LLaMA-3.1-8B (reported) ~59.5% Meta (2024)
Random baseline 25.0%

causal_judgement (20 samples)

Model Accuracy
GPT-4 67%
llama3.1:8b (this run) 50%
Random baseline 50%

Prerequisites

  • Python 3.10+
  • Ollama running locally with llama3.1:8b pulled:
    ollama pull llama3.1:8b
    ollama serve          # must be running on localhost:11434

Setup

pip install -r requirements.txt

Create a .env file (content can be empty — Ollama needs no key):

touch .env

Usage

python main.py

To switch benchmark, edit main.py line 14:

BENCHMARK = "arc_challenge"     # causal_judgement | arc_challenge

Each run produces a self-contained folder under outputs/{benchmark}/. If data/{benchmark}/results.jsonl already exists, inference is skipped — only the analysis pipeline re-runs.


Project Structure

benchmark_crew/
│
├── main.py                  # Entry point — configure BENCHMARK here
├── agents.py                # 6 CrewAI agent definitions
├── tasks.py                 # 6 task definitions with context chains
├── pipeline_logger.py       # Stdout interceptor → structured JSONL log
├── _viz_worker.py           # GIF renderer (runs in subprocess to avoid pydantic conflict)
├── make_react_gif.py        # Standalone: generate ReAct process animation
├── run_benchmark.py         # Legacy standalone inference script (superseded)
├── requirements.txt
│
├── tools/
│   ├── benchmark_tool.py    # Run or load inference (causal_judgement / arc_challenge)
│   ├── stats_tool.py        # parse_results + compute_stats
│   ├── qa_tool.py           # Per-question detail table generator
│   ├── baseline_tool.py     # Static public baseline lookup
│   ├── viz_tool.py          # Calls _viz_worker.py as subprocess
│   └── json_tool.py         # Raw JSONL reader
│
├── data/
│   ├── causal_judgement/
│   │   └── results.jsonl    # 20 samples, accuracy 50%
│   └── arc_challenge/
│       └── results.jsonl    # 30 samples, accuracy 76.7%
│
└── outputs/
    ├── causal_judgement/
    │   ├── llama3.1_8b_report.md
    │   ├── pipeline_flow.gif
    │   ├── pipeline_log.jsonl
    │   ├── pipeline_summary.json
    │   └── qa_detail.md
    ├── arc_challenge/
    │   ├── llama3.1_8b_report.md
    │   ├── pipeline_flow.gif
    │   ├── pipeline_log.jsonl
    │   ├── pipeline_summary.json
    │   └── qa_detail.md
    └── visuals/
        └── react_process.gif

Output Files

File Description
llama3.1_8b_report.md Full evaluation report (summary, stats table, error analysis, baseline comparison)
qa_detail.md Per-question breakdown: full question text, model raw reply, extracted answer, correct/wrong
pipeline_flow.gif Animated swimlane timeline — agent bars, tool call markers, scrolling event log, stage results panel
pipeline_log.jsonl Structured event log: task_start, tool_call, tool_result, agent_final, task_end
pipeline_summary.json Run metadata: model, benchmark, elapsed time, output paths

How It Works

  1. Benchmark Runner checks if data/{benchmark}/results.jsonl exists. If not, runs Ollama inference over the dataset (30 samples by default) and writes results.
  2. Data Parser reads the JSONL and reports record count, field names, category distribution, score stats, and sample records.
  3. Statistical Analyst computes per-category accuracy, std, pass@1, and lists error examples.
  4. QA Detail Reviewer generates a per-question Markdown table comparing model answers to ground truth.
  5. Comparative Analyst looks up published scores for the same benchmark and positions llama3.1:8b relative to them.
  6. Report Publisher calls the GIF renderer (via subprocess) then writes the final Markdown report using FileWriterTool.

The pipeline logger intercepts crewai's verbose stdout in real time using regex, parsing agent names, tool calls, tool results, and final answers into timestamped JSONL events — without modifying crewai internals.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages