Financial Visual Reasoning Eval Harness

Evaluates multimodal financial reasoning across three input conditions (text_only, image_only, image_title_text) and four scoring dimensions (functional, robustness, style, trajectory).

Quick start

1. Clone and install

cd /Users/saumyachauhan/Projects/fin-eval-harness
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"

2. Copy task packages

cp -r /Users/saumyachauhan/Downloads/eval_harness_tasks_25_with_artifacts_and_rubrics/* \
      data/task_packages/

3. Set API keys

Create a .env file (or export directly):

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."          # or GOOGLE_API_KEY (Gemini API / Vertex)
export OPENROUTER_API_KEY="..."      # optional; needed for openrouter/openai/gpt-5.4

4. Run the eval

# Single model
python -m scripts.run_eval \
    --task-dir data/task_packages \
    --model anthropic/claude-sonnet-4-6

# Multiple models in one run
python -m scripts.run_eval \
    --model anthropic/claude-sonnet-4-6 \
    --model openai/gpt-4o \
    --model gemini/gemini-2.5-pro

# All default models (see benchmark/model_config.py)
python -m scripts.run_eval

# Specific tasks and variants
python -m scripts.run_eval \
    --model anthropic/claude-sonnet-4-6 \
    --tasks task_001 task_002 task_003 \
    --variants text_only image_only

# Dry run (no API calls — validates data loading and prompt building)
python -m scripts.run_eval --dry-run

Outputs

Results land in results/run_<timestamp>_<model>/:

results/run_20260308_143000_anthropic_claude-sonnet-4-20250514/
├── raw_outputs/          # verbatim model responses
├── parsed_outputs/       # structured JSON extractions
├── sample_scores/        # per-sample section + composite scores
├── aggregate_summary.json
└── failure_analysis/     # (after export_slices)

Post-run analysis

Aggregate summary

python -m scripts.aggregate --run-dir results/run_20260308_143000_anthropic_claude-sonnet-4-20250514

Failure slices

python -m scripts.export_slices \
    --run-dir results/run_20260308_143000_anthropic_claude-sonnet-4-20250514 \
    --task-dir data/task_packages

Modal (parallel execution at scale)

Modal fans out sample evaluation across parallel cloud workers. Prompts (including base64-encoded images) are built locally and shipped to workers as self-contained payloads — no shared filesystem needed for input data.

One-time setup

pip install modal
modal token new                    # authenticates your CLI with Modal

Create secrets in the Modal dashboard (https://modal.com/secrets) or via CLI:

modal secret create anthropic-secret ANTHROPIC_API_KEY=sk-ant-...
modal secret create openai-secret    OPENAI_API_KEY=sk-...
modal secret create gemini-secret    GEMINI_API_KEY=...
modal secret create openrouter-secret OPENROUTER_API_KEY=...   # optional, for GPT-5.4 via OpenRouter

Mode A: Modal + litellm (parallel across samples)

# Single model — all 30 samples fan out in parallel
modal run runtime/modal_app.py --model anthropic/claude-sonnet-4-6

# All default models (runs sequentially per model, parallel within)
modal run runtime/modal_app.py

# Multiple models (comma-separated)
modal run runtime/modal_app.py --model "anthropic/claude-sonnet-4-6,openrouter/openai/gpt-5.4"

# Specific tasks
modal run runtime/modal_app.py --model openai/gpt-4o --tasks "task_001,task_002,task_003"

# Specific variants
modal run runtime/modal_app.py --model openai/gpt-4o --variants "text_only,image_only"

# Control concurrency (default 10, increase for bigger batches)
modal run runtime/modal_app.py --concurrency 30

How it works

Local machine                         Modal cloud
─────────────                         ───────────
Load 25 task packages          ┌──→  Worker 1: call model, parse, score sample 1
Filter to 10 with GT           ├──→  Worker 2: call model, parse, score sample 2
Flatten → 30 samples           ├──→  Worker 3: call model, parse, score sample 3
Build prompts (base64 imgs)    ├──→  ...
Fan out via .map() ────────────┤
                               └──→  Worker N: call model, parse, score sample 30
Collect results  ◄─────────────────
Robustness scoring (grouped)        Results also persisted to
Write local results/                Modal Volume (fin-eval-results)
Print summary

Results are written both locally (to results/) and to a Modal Volume (fin-eval-results) that persists across runs.

Mode B: Modal + Inspect (parallel across models)

Runs the full Inspect eval framework inside Modal workers — one worker per model, all running simultaneously. Task packages are baked into the image.

# All default models in parallel (one Modal worker each)
modal run runtime/modal_inspect.py

# Single model
modal run runtime/modal_inspect.py --model anthropic/claude-sonnet-4-6

# Multiple models
modal run runtime/modal_inspect.py --model "anthropic/claude-sonnet-4-6,openrouter/openai/gpt-5.4"

Inspect logs are persisted to the Modal Volume at fin-eval-results/inspect_<timestamp>_<model>/inspect_logs/.

Inspect AI (optional, local)

If you prefer the Inspect framework:

pip install inspect-ai

# Set task package location
export FIN_EVAL_TASK_DIR=data/task_packages

# Run via wrapper (recommended): log filenames include the model name
python -m scripts.run_inspect_eval --model anthropic/claude-sonnet-4-6

# Or run Inspect CLI directly (log filenames won't include model unless you set the env below)
inspect eval benchmark/task.py --model anthropic/claude-sonnet-4-6

Note: Inspect uses provider name google for Gemini (not gemini). Use google/gemini-2.5-pro or google/gemini-3.1-pro-preview with Inspect; use gemini/... only with run_eval (litellm). For Gemini, set only one of GOOGLE_API_KEY or GEMINI_API_KEY; if both are set, the google-genai library logs a warning on every request. For GPT-5.4 use openrouter/openai/gpt-5.4 with OPENROUTER_API_KEY set.

Log filenames with model: The wrapper sets INSPECT_EVAL_LOG_FILE_PATTERN so each .eval file is named like {timestamp}_financial-eval_{model}_{run_id}.eval (e.g. ..._anthropic-claude-sonnet-4-20250514_7QvaUVVp....eval). To get the same when calling inspect eval directly, set:

export INSPECT_EVAL_LOG_FILE_PATTERN="{task}_{model}_{id}"
inspect eval benchmark/task.py --model anthropic/claude-sonnet-4-6

Expert annotation merge

When new expert answers arrive (as expert_answer_template.json files inside each task folder):

# Preview what would change
python -m scripts.merge_expert_answers --task-dir data/task_packages --dry-run

# Actually merge
python -m scripts.merge_expert_answers --task-dir data/task_packages

Docker

# Build
docker build -t fin-eval-harness -f runtime/docker/Dockerfile .

# Run
docker run \
    -v $(pwd)/data/task_packages:/app/data/task_packages \
    -v $(pwd)/results:/app/results \
    -e ANTHROPIC_API_KEY \
    -e OPENAI_API_KEY \
    -e GEMINI_API_KEY \
    fin-eval-harness \
    --model anthropic/claude-sonnet-4-20250514

Supported models

Default models (March 2026) are defined in benchmark/model_config.py. You can override with --model for any run.

Provider	Model string (Inspect / Modal)	API key / notes
Anthropic	`anthropic/claude-sonnet-4-6`	ANTHROPIC_API_KEY
Anthropic	`anthropic/claude-opus-4-6`	ANTHROPIC_API_KEY
OpenAI	`openai/gpt-4o`, `openai/gpt-5.4`	OPENAI_API_KEY
OpenRouter	`openrouter/openai/gpt-5.4`	OPENROUTER_API_KEY
Google	`google/gemini-2.5-pro`, `google/gemini-3.1-pro-preview`	GOOGLE_API_KEY or GEMINI_API_KEY

Inspect uses google/ for Gemini (not gemini/). Use google/gemini-3.1-pro-preview for Gemini 3.1 Pro (Gemini API or Vertex).
GPT-5.4: use openrouter/openai/gpt-5.4 with OpenRouter, or openai/gpt-5.4 with OpenAI API.
Any model supported by litellm can be used for run_eval / Modal litellm mode.

Check API keys and model availability

Before running evals, confirm which keys are set and which models are reachable:

# Which API keys are set (no key values printed)
python -m scripts.check_api_keys

# Probe each default/latest model with a minimal completion (confirms they work here)
python -m scripts.check_model_availability

See scripts/check_model_availability.py for options (e.g. --models to test specific model IDs).

Scoring weights

Section	Weight
Functional	0.50
Trajectory	0.25
Robustness	0.15
Style	0.10

Criterion weights: major = 2, minor = 1.

LLM grader for categorical answers

For categorical (non-numeric) answers, correctness is normally judged by case-insensitive exact string match. You can instead use an LLM to judge whether the model’s answer is correct (same meaning as ground truth):

Set USE_LLM_GRADER_FOR_CATEGORICAL=1 when running eval or Inspect.
Optionally set LLM_GRADER_MODEL (default: claude-sonnet-4-20250514, same as typical eval model).

How the grader is called: The grader uses the Anthropic API only (no LiteLLM). It needs ANTHROPIC_API_KEY. If you see 404 for the grader, unset LLM_GRADER_MODEL so the default is used, or set it to a valid model (e.g. claude-sonnet-4-20250514 or claude-haiku-4-5).

The grader receives the question, ground truth, and model answer and returns CORRECT or INCORRECT. This works in all run modes: local run_eval, Inspect (inspect eval), Modal, and Modal + Inspect. Set the env vars before running; Modal passes them into workers.

After evals: extract and analyze (Inspect)

After running Inspect evals for one or more models:

Extract .eval logs into results.json per run:
```
python -m scripts.extract_inspect_results --log-dir logs/ --out-dir results/inspect_4model_runs
```
- Input: logs/*.eval (Inspect eval logs).
- Output: One folder per run under results/inspect_4model_runs/, each with results.json (and optional report.md). Each results.json has meta (model, etc.) and per_sample (task_id, variant, answer_correct, functional_score, trajectory_score, composite_score, etc.).
Analyze by task type and variant (tables + narrative). A task-type mapping file is required (JSON: task_id → type, e.g. percentage_change, margin):
```
python -m scripts.analyze_results_by_task_and_variant \
  --task-type-mapping /path/to/task_type_mapping.json \
  --results-dir results/inspect_4model_runs \
  --task-dir data/task_packages \
  --out results/inspect_4model_runs/ANALYSIS_REPORT.txt
```
- Input: All results/inspect_4model_runs/*/results.json and task types from --task-type-mapping only.
- Output: results/inspect_4model_runs/ANALYSIS_REPORT.txt containing:
  - Tables: per task type, per variant, per model, task type × variant, variance across variants (per model and per task type), model × variant, task type × variant × model.
  - A short narrative interpreting variance and strengths by task type/variant.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmark		benchmark
runtime		runtime
scorers		scorers
scripts		scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Financial Visual Reasoning Eval Harness

Quick start

1. Clone and install

2. Copy task packages

3. Set API keys

4. Run the eval

Outputs

Post-run analysis

Aggregate summary

Failure slices

Modal (parallel execution at scale)

One-time setup

Mode A: Modal + litellm (parallel across samples)

How it works

Mode B: Modal + Inspect (parallel across models)

Inspect AI (optional, local)

Expert annotation merge

Docker

Supported models

Check API keys and model availability

Scoring weights

LLM grader for categorical answers

After evals: extract and analyze (Inspect)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Financial Visual Reasoning Eval Harness

Quick start

1. Clone and install

2. Copy task packages

3. Set API keys

4. Run the eval

Outputs

Post-run analysis

Aggregate summary

Failure slices

Modal (parallel execution at scale)

One-time setup

Mode A: Modal + litellm (parallel across samples)

How it works

Mode B: Modal + Inspect (parallel across models)

Inspect AI (optional, local)

Expert annotation merge

Docker

Supported models

Check API keys and model availability

Scoring weights

LLM grader for categorical answers

After evals: extract and analyze (Inspect)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages