Skip to content

ajtgjmdjp/jfinqa

Repository files navigation

jfinqa

Japanese Financial Numerical Reasoning QA Benchmark.

PyPI Python CI Downloads HuggingFace License Leaderboard

What is this?

jfinqa is a benchmark for evaluating LLMs on Japanese financial numerical reasoning. Unlike existing benchmarks that focus on classification or simple lookup, jfinqa requires multi-step arithmetic over financial statement tables extracted from real Japanese corporate disclosures (EDINET). Questions include DuPont decomposition (6-step), growth rate calculations, and cross-statement ratio analysis.

Three Subtasks

Subtask Description Example
Numerical Reasoning Calculate financial metrics from table data "2024年3月期の売上高成長率は何%か?"
Consistency Checking Verify internal consistency of reported figures "資産合計は流動資産と固定資産の合計と一致するか?"
Temporal Reasoning Analyze trends and changes across periods "売上高が最も低かったのはどの年度か?"

Dataset Statistics

Total Numerical Reasoning Consistency Checking Temporal Reasoning
Questions 1000 550 200 250
Companies 68
Accounting Standards J-GAAP 58%, IFRS 38%, US-GAAP 4%
Avg. program steps 2.59 2.84 2.00 2.54
Avg. table rows 13.3
Max program steps 6 (DuPont)

Baseline Results

Model Overall Numerical Reasoning Consistency Checking Temporal Reasoning
GPT-4o 87.0% 80.2% 90.5% 99.2%
Gemini 2.0 Flash 80.4% 86.2% 83.5% 65.2%
GPT-4o-mini 67.7% 79.3% 83.5% 29.6%
Qwen2.5-3B-Instruct 39.6% 46.4% 51.0% 15.6%

1000 questions, zero-shot, temperature=0. Evaluation uses numerical matching with 1% tolerance. Qwen2.5-3B-Instruct run locally with MLX (4-bit quantization).

View full leaderboard →

Error Analysis

Systematic error analysis revealed both benchmark design issues and genuine LLM failure patterns.

Key findings:

  • Clear capability gradient: GPT-4o (87%) > Gemini 2.0 Flash (80%) > GPT-4o-mini (68%) >> Qwen2.5-3B (40%), validating the benchmark discriminates across model sizes and capabilities.
  • Temporal reasoning separates frontier models: GPT-4o achieves 99.2% on TR, while Gemini drops to 65.2% and GPT-4o-mini to 29.6%. This subtask requires strict output format compliance ("増収"/"減収" rather than "はい"/"いいえ"), which strongly differentiates models.
  • Gemini 2.0 Flash leads on numerical reasoning (86.2% vs GPT-4o's 80.2%), suggesting strong arithmetic capabilities, but falls behind on consistency checking and temporal reasoning where format compliance matters more.
  • DuPont decomposition is the hardest subtask: 6-step ROE decomposition questions (56 questions) see significant accuracy drops even for frontier models, while 3B models rarely solve them correctly.
  • GPT-4o-mini has a systematic prompt compliance issue in temporal reasoning. It answers "はい" (yes) to questions like "増収か減収か?" despite correctly analyzing the direction in its reasoning chain (122 of 176 TR errors follow this pattern).
  • J-GAAP balance sheet structure is a major error source. Models confuse 純資産合計 (net assets) with 株主資本 (shareholders' equity), and decompose 総資産 into 4 sub-categories instead of the standard 2.
  • Qwen2.5-3B-Instruct struggles most with temporal reasoning (15.6%) and consistency checking (51.0%), suggesting that smaller models have difficulty with instruction-following and multi-step verification tasks in Japanese.

Key Features

  • FinQA-compatible: Same data format as FinQA for cross-benchmark comparison
  • Japan-specific: Handles J-GAAP, IFRS, US-GAAP, and Japanese number formats (百万円, 億円, △)
  • Dual evaluation: Exact match and numerical match with tolerance
  • lm-evaluation-harness integration: Ready-to-use YAML task configs
  • Source provenance: Every question links back to its EDINET filing

Quick Start

Installation

pip install jfinqa
# or
uv add jfinqa

Evaluate Your Model

from jfinqa import load_dataset, evaluate

# Load benchmark questions
questions = load_dataset("numerical_reasoning")

# Provide predictions
predictions = {"nr_001": "25.0%", "nr_002": "16.0%"}
result = evaluate(questions, predictions=predictions)
print(result.summary())

Or Use a Model Function

from jfinqa import load_dataset, evaluate

questions = load_dataset()

def my_model(question: str, context: str) -> str:
    # Your model inference here
    return "42.5%"

result = evaluate(questions, model_fn=my_model)
print(result.summary())

CLI

# Inspect dataset questions
jfinqa inspect -s numerical_reasoning -n 5

# Evaluate predictions file
jfinqa evaluate -p predictions.json

# Evaluate with local data
jfinqa evaluate -p predictions.json -d local_data.json -s numerical_reasoning

lm-evaluation-harness

PR #3570 is pending. Once merged:

lm-eval run --model openai-completions \
    --model_args model=gpt-4o \
    --tasks jfinqa \
    --num_fewshot 0

Before merge, use --include_path:

lm-eval run --model openai-completions \
    --model_args model=gpt-4o \
    --tasks jfinqa \
    --num_fewshot 0 \
    --include_path lm_eval_tasks/

Data Format

Each question follows the FinQA schema with additional metadata:

{
  "id": "nr_001",
  "subtask": "numerical_reasoning",
  "pre_text": ["以下はA社の連結損益計算書の抜粋である。"],
  "post_text": ["当期は前期比で増収増益となった。"],
  "table": {
    "headers": ["", "2024年3月期", "2023年3月期"],
    "rows": [
      ["売上高", "1,500,000", "1,200,000"],
      ["営業利益", "200,000", "150,000"]
    ]
  },
  "qa": {
    "question": "2024年3月期の売上高成長率は何%か?",
    "program": ["subtract(1500000, 1200000)", "divide(#0, 1200000)", "multiply(#1, 100)"],
    "answer": "25.0%",
    "gold_evidence": [0]
  },
  "edinet_code": "E00001",
  "filing_year": "2024",
  "accounting_standard": "J-GAAP"
}

Japanese Number Handling

jfinqa correctly normalizes Japanese financial number formats:

Input Extracted Value Notes
△1,000 -1,000 Triangle negative marker
12,345 12,345 Fullwidth digits + comma removal
24,956百万円 24,956 Compound financial units treated as labels
50億 5,000,000,000 Bare kanji multiplier applied
42.5% 42.5 Percentage

Development

git clone https://github.com/ajtgjmdjp/jfinqa
cd jfinqa
uv sync --dev --extra dev
uv run pytest -v
uv run ruff check .
uv run mypy src/

Data Attribution

Source financial data is obtained from EDINET (Electronic Disclosure for Investors' NETwork), operated by the Financial Services Agency of Japan (金融庁). EDINET data is provided under the Public Data License 1.0.

The data format is compatible with FinQA (Chen et al., 2021).

Related Projects

  • FinQA — English financial QA benchmark (Chen et al., 2021)
  • TAT-QA — Tabular and textual QA
  • edinet-mcp — EDINET XBRL parser (companion project)
  • EDINET-Bench — Sakana AI's financial classification benchmark

Citation

If you use jfinqa in your research, please cite it as follows:

@dataset{jfinqa2025,
  title={jfinqa: Japanese Financial Numerical Reasoning QA Benchmark},
  author={ajtgjmdjp},
  year={2025},
  url={https://github.com/ajtgjmdjp/jfinqa},
  license={Apache-2.0}
}

License

Apache-2.0. See NOTICE for third-party attributions.

Packages

 
 
 

Contributors

Languages