Skip to content

Mercor-Intelligence/fin-eval-harness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Financial Visual Reasoning Eval Harness

Evaluates multimodal financial reasoning across three input conditions (text_only, image_only, image_title_text) and four scoring dimensions (functional, robustness, style, trajectory).


Quick start

1. Clone and install

cd /Users/saumyachauhan/Projects/fin-eval-harness
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"

2. Copy task packages

cp -r /Users/saumyachauhan/Downloads/eval_harness_tasks_25_with_artifacts_and_rubrics/* \
      data/task_packages/

3. Set API keys

Create a .env file (or export directly):

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."          # or GOOGLE_API_KEY (Gemini API / Vertex)
export OPENROUTER_API_KEY="..."      # optional; needed for openrouter/openai/gpt-5.4

4. Run the eval

# Single model
python -m scripts.run_eval \
    --task-dir data/task_packages \
    --model anthropic/claude-sonnet-4-6

# Multiple models in one run
python -m scripts.run_eval \
    --model anthropic/claude-sonnet-4-6 \
    --model openai/gpt-4o \
    --model gemini/gemini-2.5-pro

# All default models (see benchmark/model_config.py)
python -m scripts.run_eval

# Specific tasks and variants
python -m scripts.run_eval \
    --model anthropic/claude-sonnet-4-6 \
    --tasks task_001 task_002 task_003 \
    --variants text_only image_only

# Dry run (no API calls — validates data loading and prompt building)
python -m scripts.run_eval --dry-run

Outputs

Results land in results/run_<timestamp>_<model>/:

results/run_20260308_143000_anthropic_claude-sonnet-4-20250514/
├── raw_outputs/          # verbatim model responses
├── parsed_outputs/       # structured JSON extractions
├── sample_scores/        # per-sample section + composite scores
├── aggregate_summary.json
└── failure_analysis/     # (after export_slices)

Post-run analysis

Aggregate summary

python -m scripts.aggregate --run-dir results/run_20260308_143000_anthropic_claude-sonnet-4-20250514

Failure slices

python -m scripts.export_slices \
    --run-dir results/run_20260308_143000_anthropic_claude-sonnet-4-20250514 \
    --task-dir data/task_packages

Modal (parallel execution at scale)

Modal fans out sample evaluation across parallel cloud workers. Prompts (including base64-encoded images) are built locally and shipped to workers as self-contained payloads — no shared filesystem needed for input data.

One-time setup

pip install modal
modal token new                    # authenticates your CLI with Modal

Create secrets in the Modal dashboard (https://modal.com/secrets) or via CLI:

modal secret create anthropic-secret ANTHROPIC_API_KEY=sk-ant-...
modal secret create openai-secret    OPENAI_API_KEY=sk-...
modal secret create gemini-secret    GEMINI_API_KEY=...
modal secret create openrouter-secret OPENROUTER_API_KEY=...   # optional, for GPT-5.4 via OpenRouter

Mode A: Modal + litellm (parallel across samples)

# Single model — all 30 samples fan out in parallel
modal run runtime/modal_app.py --model anthropic/claude-sonnet-4-6

# All default models (runs sequentially per model, parallel within)
modal run runtime/modal_app.py

# Multiple models (comma-separated)
modal run runtime/modal_app.py --model "anthropic/claude-sonnet-4-6,openrouter/openai/gpt-5.4"

# Specific tasks
modal run runtime/modal_app.py --model openai/gpt-4o --tasks "task_001,task_002,task_003"

# Specific variants
modal run runtime/modal_app.py --model openai/gpt-4o --variants "text_only,image_only"

# Control concurrency (default 10, increase for bigger batches)
modal run runtime/modal_app.py --concurrency 30

How it works

Local machine                         Modal cloud
─────────────                         ───────────
Load 25 task packages          ┌──→  Worker 1: call model, parse, score sample 1
Filter to 10 with GT           ├──→  Worker 2: call model, parse, score sample 2
Flatten → 30 samples           ├──→  Worker 3: call model, parse, score sample 3
Build prompts (base64 imgs)    ├──→  ...
Fan out via .map() ────────────┤
                               └──→  Worker N: call model, parse, score sample 30
Collect results  ◄─────────────────
Robustness scoring (grouped)        Results also persisted to
Write local results/                Modal Volume (fin-eval-results)
Print summary

Results are written both locally (to results/) and to a Modal Volume (fin-eval-results) that persists across runs.

Mode B: Modal + Inspect (parallel across models)

Runs the full Inspect eval framework inside Modal workers — one worker per model, all running simultaneously. Task packages are baked into the image.

# All default models in parallel (one Modal worker each)
modal run runtime/modal_inspect.py

# Single model
modal run runtime/modal_inspect.py --model anthropic/claude-sonnet-4-6

# Multiple models
modal run runtime/modal_inspect.py --model "anthropic/claude-sonnet-4-6,openrouter/openai/gpt-5.4"

Inspect logs are persisted to the Modal Volume at fin-eval-results/inspect_<timestamp>_<model>/inspect_logs/.


Inspect AI (optional, local)

If you prefer the Inspect framework:

pip install inspect-ai

# Set task package location
export FIN_EVAL_TASK_DIR=data/task_packages

# Run via wrapper (recommended): log filenames include the model name
python -m scripts.run_inspect_eval --model anthropic/claude-sonnet-4-6

# Or run Inspect CLI directly (log filenames won't include model unless you set the env below)
inspect eval benchmark/task.py --model anthropic/claude-sonnet-4-6

Note: Inspect uses provider name google for Gemini (not gemini). Use google/gemini-2.5-pro or google/gemini-3.1-pro-preview with Inspect; use gemini/... only with run_eval (litellm). For Gemini, set only one of GOOGLE_API_KEY or GEMINI_API_KEY; if both are set, the google-genai library logs a warning on every request. For GPT-5.4 use openrouter/openai/gpt-5.4 with OPENROUTER_API_KEY set.

Log filenames with model: The wrapper sets INSPECT_EVAL_LOG_FILE_PATTERN so each .eval file is named like {timestamp}_financial-eval_{model}_{run_id}.eval (e.g. ..._anthropic-claude-sonnet-4-20250514_7QvaUVVp....eval). To get the same when calling inspect eval directly, set:

export INSPECT_EVAL_LOG_FILE_PATTERN="{task}_{model}_{id}"
inspect eval benchmark/task.py --model anthropic/claude-sonnet-4-6

Expert annotation merge

When new expert answers arrive (as expert_answer_template.json files inside each task folder):

# Preview what would change
python -m scripts.merge_expert_answers --task-dir data/task_packages --dry-run

# Actually merge
python -m scripts.merge_expert_answers --task-dir data/task_packages

Docker

# Build
docker build -t fin-eval-harness -f runtime/docker/Dockerfile .

# Run
docker run \
    -v $(pwd)/data/task_packages:/app/data/task_packages \
    -v $(pwd)/results:/app/results \
    -e ANTHROPIC_API_KEY \
    -e OPENAI_API_KEY \
    -e GEMINI_API_KEY \
    fin-eval-harness \
    --model anthropic/claude-sonnet-4-20250514

Supported models

Default models (March 2026) are defined in benchmark/model_config.py. You can override with --model for any run.

Provider Model string (Inspect / Modal) API key / notes
Anthropic anthropic/claude-sonnet-4-6 ANTHROPIC_API_KEY
Anthropic anthropic/claude-opus-4-6 ANTHROPIC_API_KEY
OpenAI openai/gpt-4o, openai/gpt-5.4 OPENAI_API_KEY
OpenRouter openrouter/openai/gpt-5.4 OPENROUTER_API_KEY
Google google/gemini-2.5-pro, google/gemini-3.1-pro-preview GOOGLE_API_KEY or GEMINI_API_KEY
  • Inspect uses google/ for Gemini (not gemini/). Use google/gemini-3.1-pro-preview for Gemini 3.1 Pro (Gemini API or Vertex).
  • GPT-5.4: use openrouter/openai/gpt-5.4 with OpenRouter, or openai/gpt-5.4 with OpenAI API.
  • Any model supported by litellm can be used for run_eval / Modal litellm mode.

Check API keys and model availability

Before running evals, confirm which keys are set and which models are reachable:

# Which API keys are set (no key values printed)
python -m scripts.check_api_keys

# Probe each default/latest model with a minimal completion (confirms they work here)
python -m scripts.check_model_availability

See scripts/check_model_availability.py for options (e.g. --models to test specific model IDs).


Scoring weights

Section Weight
Functional 0.50
Trajectory 0.25
Robustness 0.15
Style 0.10

Criterion weights: major = 2, minor = 1.

LLM grader for categorical answers

For categorical (non-numeric) answers, correctness is normally judged by case-insensitive exact string match. You can instead use an LLM to judge whether the model’s answer is correct (same meaning as ground truth):

  • Set USE_LLM_GRADER_FOR_CATEGORICAL=1 when running eval or Inspect.
  • Optionally set LLM_GRADER_MODEL (default: claude-sonnet-4-20250514, same as typical eval model).

How the grader is called: The grader uses the Anthropic API only (no LiteLLM). It needs ANTHROPIC_API_KEY. If you see 404 for the grader, unset LLM_GRADER_MODEL so the default is used, or set it to a valid model (e.g. claude-sonnet-4-20250514 or claude-haiku-4-5).

The grader receives the question, ground truth, and model answer and returns CORRECT or INCORRECT. This works in all run modes: local run_eval, Inspect (inspect eval), Modal, and Modal + Inspect. Set the env vars before running; Modal passes them into workers.


After evals: extract and analyze (Inspect)

After running Inspect evals for one or more models:

  1. Extract .eval logs into results.json per run:

    python -m scripts.extract_inspect_results --log-dir logs/ --out-dir results/inspect_4model_runs
    • Input: logs/*.eval (Inspect eval logs).
    • Output: One folder per run under results/inspect_4model_runs/, each with results.json (and optional report.md). Each results.json has meta (model, etc.) and per_sample (task_id, variant, answer_correct, functional_score, trajectory_score, composite_score, etc.).
  2. Analyze by task type and variant (tables + narrative). A task-type mapping file is required (JSON: task_id → type, e.g. percentage_change, margin):

    python -m scripts.analyze_results_by_task_and_variant \
      --task-type-mapping /path/to/task_type_mapping.json \
      --results-dir results/inspect_4model_runs \
      --task-dir data/task_packages \
      --out results/inspect_4model_runs/ANALYSIS_REPORT.txt
    • Input: All results/inspect_4model_runs/*/results.json and task types from --task-type-mapping only.
    • Output: results/inspect_4model_runs/ANALYSIS_REPORT.txt containing:
      • Tables: per task type, per variant, per model, task type × variant, variance across variants (per model and per task type), model × variant, task type × variant × model.
      • A short narrative interpreting variance and strengths by task type/variant.

About

Financial Visual Reasoning Eval Harness

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors