Evaluates multimodal financial reasoning across three input conditions
(text_only, image_only, image_title_text) and four scoring dimensions
(functional, robustness, style, trajectory).
cd /Users/saumyachauhan/Projects/fin-eval-harness
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"cp -r /Users/saumyachauhan/Downloads/eval_harness_tasks_25_with_artifacts_and_rubrics/* \
data/task_packages/Create a .env file (or export directly):
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..." # or GOOGLE_API_KEY (Gemini API / Vertex)
export OPENROUTER_API_KEY="..." # optional; needed for openrouter/openai/gpt-5.4# Single model
python -m scripts.run_eval \
--task-dir data/task_packages \
--model anthropic/claude-sonnet-4-6
# Multiple models in one run
python -m scripts.run_eval \
--model anthropic/claude-sonnet-4-6 \
--model openai/gpt-4o \
--model gemini/gemini-2.5-pro
# All default models (see benchmark/model_config.py)
python -m scripts.run_eval
# Specific tasks and variants
python -m scripts.run_eval \
--model anthropic/claude-sonnet-4-6 \
--tasks task_001 task_002 task_003 \
--variants text_only image_only
# Dry run (no API calls — validates data loading and prompt building)
python -m scripts.run_eval --dry-runResults land in results/run_<timestamp>_<model>/:
results/run_20260308_143000_anthropic_claude-sonnet-4-20250514/
├── raw_outputs/ # verbatim model responses
├── parsed_outputs/ # structured JSON extractions
├── sample_scores/ # per-sample section + composite scores
├── aggregate_summary.json
└── failure_analysis/ # (after export_slices)
python -m scripts.aggregate --run-dir results/run_20260308_143000_anthropic_claude-sonnet-4-20250514python -m scripts.export_slices \
--run-dir results/run_20260308_143000_anthropic_claude-sonnet-4-20250514 \
--task-dir data/task_packagesModal fans out sample evaluation across parallel cloud workers. Prompts (including base64-encoded images) are built locally and shipped to workers as self-contained payloads — no shared filesystem needed for input data.
pip install modal
modal token new # authenticates your CLI with ModalCreate secrets in the Modal dashboard (https://modal.com/secrets) or via CLI:
modal secret create anthropic-secret ANTHROPIC_API_KEY=sk-ant-...
modal secret create openai-secret OPENAI_API_KEY=sk-...
modal secret create gemini-secret GEMINI_API_KEY=...
modal secret create openrouter-secret OPENROUTER_API_KEY=... # optional, for GPT-5.4 via OpenRouter# Single model — all 30 samples fan out in parallel
modal run runtime/modal_app.py --model anthropic/claude-sonnet-4-6
# All default models (runs sequentially per model, parallel within)
modal run runtime/modal_app.py
# Multiple models (comma-separated)
modal run runtime/modal_app.py --model "anthropic/claude-sonnet-4-6,openrouter/openai/gpt-5.4"
# Specific tasks
modal run runtime/modal_app.py --model openai/gpt-4o --tasks "task_001,task_002,task_003"
# Specific variants
modal run runtime/modal_app.py --model openai/gpt-4o --variants "text_only,image_only"
# Control concurrency (default 10, increase for bigger batches)
modal run runtime/modal_app.py --concurrency 30Local machine Modal cloud
───────────── ───────────
Load 25 task packages ┌──→ Worker 1: call model, parse, score sample 1
Filter to 10 with GT ├──→ Worker 2: call model, parse, score sample 2
Flatten → 30 samples ├──→ Worker 3: call model, parse, score sample 3
Build prompts (base64 imgs) ├──→ ...
Fan out via .map() ────────────┤
└──→ Worker N: call model, parse, score sample 30
Collect results ◄─────────────────
Robustness scoring (grouped) Results also persisted to
Write local results/ Modal Volume (fin-eval-results)
Print summary
Results are written both locally (to results/) and to a Modal Volume
(fin-eval-results) that persists across runs.
Runs the full Inspect eval framework inside Modal workers — one worker per model, all running simultaneously. Task packages are baked into the image.
# All default models in parallel (one Modal worker each)
modal run runtime/modal_inspect.py
# Single model
modal run runtime/modal_inspect.py --model anthropic/claude-sonnet-4-6
# Multiple models
modal run runtime/modal_inspect.py --model "anthropic/claude-sonnet-4-6,openrouter/openai/gpt-5.4"Inspect logs are persisted to the Modal Volume at
fin-eval-results/inspect_<timestamp>_<model>/inspect_logs/.
If you prefer the Inspect framework:
pip install inspect-ai
# Set task package location
export FIN_EVAL_TASK_DIR=data/task_packages
# Run via wrapper (recommended): log filenames include the model name
python -m scripts.run_inspect_eval --model anthropic/claude-sonnet-4-6
# Or run Inspect CLI directly (log filenames won't include model unless you set the env below)
inspect eval benchmark/task.py --model anthropic/claude-sonnet-4-6Note: Inspect uses provider name google for Gemini (not gemini). Use google/gemini-2.5-pro or google/gemini-3.1-pro-preview with Inspect; use gemini/... only with run_eval (litellm). For Gemini, set only one of GOOGLE_API_KEY or GEMINI_API_KEY; if both are set, the google-genai library logs a warning on every request. For GPT-5.4 use openrouter/openai/gpt-5.4 with OPENROUTER_API_KEY set.
Log filenames with model: The wrapper sets INSPECT_EVAL_LOG_FILE_PATTERN so each .eval file is named like {timestamp}_financial-eval_{model}_{run_id}.eval (e.g. ..._anthropic-claude-sonnet-4-20250514_7QvaUVVp....eval). To get the same when calling inspect eval directly, set:
export INSPECT_EVAL_LOG_FILE_PATTERN="{task}_{model}_{id}"
inspect eval benchmark/task.py --model anthropic/claude-sonnet-4-6When new expert answers arrive (as expert_answer_template.json files inside
each task folder):
# Preview what would change
python -m scripts.merge_expert_answers --task-dir data/task_packages --dry-run
# Actually merge
python -m scripts.merge_expert_answers --task-dir data/task_packages# Build
docker build -t fin-eval-harness -f runtime/docker/Dockerfile .
# Run
docker run \
-v $(pwd)/data/task_packages:/app/data/task_packages \
-v $(pwd)/results:/app/results \
-e ANTHROPIC_API_KEY \
-e OPENAI_API_KEY \
-e GEMINI_API_KEY \
fin-eval-harness \
--model anthropic/claude-sonnet-4-20250514Default models (March 2026) are defined in benchmark/model_config.py. You can override with --model for any run.
| Provider | Model string (Inspect / Modal) | API key / notes |
|---|---|---|
| Anthropic | anthropic/claude-sonnet-4-6 |
ANTHROPIC_API_KEY |
| Anthropic | anthropic/claude-opus-4-6 |
ANTHROPIC_API_KEY |
| OpenAI | openai/gpt-4o, openai/gpt-5.4 |
OPENAI_API_KEY |
| OpenRouter | openrouter/openai/gpt-5.4 |
OPENROUTER_API_KEY |
google/gemini-2.5-pro, google/gemini-3.1-pro-preview |
GOOGLE_API_KEY or GEMINI_API_KEY |
- Inspect uses
google/for Gemini (notgemini/). Usegoogle/gemini-3.1-pro-previewfor Gemini 3.1 Pro (Gemini API or Vertex). - GPT-5.4: use
openrouter/openai/gpt-5.4with OpenRouter, oropenai/gpt-5.4with OpenAI API. - Any model supported by litellm can be used for
run_eval/ Modal litellm mode.
Before running evals, confirm which keys are set and which models are reachable:
# Which API keys are set (no key values printed)
python -m scripts.check_api_keys
# Probe each default/latest model with a minimal completion (confirms they work here)
python -m scripts.check_model_availabilitySee scripts/check_model_availability.py for options (e.g. --models to test specific model IDs).
| Section | Weight |
|---|---|
| Functional | 0.50 |
| Trajectory | 0.25 |
| Robustness | 0.15 |
| Style | 0.10 |
Criterion weights: major = 2, minor = 1.
For categorical (non-numeric) answers, correctness is normally judged by case-insensitive exact string match. You can instead use an LLM to judge whether the model’s answer is correct (same meaning as ground truth):
- Set
USE_LLM_GRADER_FOR_CATEGORICAL=1when running eval or Inspect. - Optionally set
LLM_GRADER_MODEL(default:claude-sonnet-4-20250514, same as typical eval model).
How the grader is called: The grader uses the Anthropic API only (no LiteLLM). It needs ANTHROPIC_API_KEY. If you see 404 for the grader, unset LLM_GRADER_MODEL so the default is used, or set it to a valid model (e.g. claude-sonnet-4-20250514 or claude-haiku-4-5).
The grader receives the question, ground truth, and model answer and returns CORRECT or INCORRECT. This works in all run modes: local run_eval, Inspect (inspect eval), Modal, and Modal + Inspect. Set the env vars before running; Modal passes them into workers.
After running Inspect evals for one or more models:
-
Extract
.evallogs intoresults.jsonper run:python -m scripts.extract_inspect_results --log-dir logs/ --out-dir results/inspect_4model_runs
- Input:
logs/*.eval(Inspect eval logs). - Output: One folder per run under
results/inspect_4model_runs/, each withresults.json(and optionalreport.md). Eachresults.jsonhasmeta(model, etc.) andper_sample(task_id, variant, answer_correct, functional_score, trajectory_score, composite_score, etc.).
- Input:
-
Analyze by task type and variant (tables + narrative). A task-type mapping file is required (JSON: task_id → type, e.g.
percentage_change,margin):python -m scripts.analyze_results_by_task_and_variant \ --task-type-mapping /path/to/task_type_mapping.json \ --results-dir results/inspect_4model_runs \ --task-dir data/task_packages \ --out results/inspect_4model_runs/ANALYSIS_REPORT.txt
- Input: All
results/inspect_4model_runs/*/results.jsonand task types from--task-type-mappingonly. - Output:
results/inspect_4model_runs/ANALYSIS_REPORT.txtcontaining:- Tables: per task type, per variant, per model, task type × variant, variance across variants (per model and per task type), model × variant, task type × variant × model.
- A short narrative interpreting variance and strengths by task type/variant.
- Input: All