jfinqa

Japanese Financial Numerical Reasoning QA Benchmark.

What is this?

jfinqa is a benchmark for evaluating LLMs on Japanese financial numerical reasoning. Unlike existing benchmarks that focus on classification or simple lookup, jfinqa requires multi-step arithmetic over financial statement tables extracted from real Japanese corporate disclosures (EDINET). Questions include DuPont decomposition (6-step), growth rate calculations, and cross-statement ratio analysis.

Three Subtasks

Subtask	Description	Example
Numerical Reasoning	Calculate financial metrics from table data	"2024年3月期の売上高成長率は何%か？"
Consistency Checking	Verify internal consistency of reported figures	"資産合計は流動資産と固定資産の合計と一致するか？"
Temporal Reasoning	Analyze trends and changes across periods	"売上高が最も低かったのはどの年度か？"

Dataset Statistics

	Total	Numerical Reasoning	Consistency Checking	Temporal Reasoning
Questions	1000	550	200	250
Companies	68	—	—	—
Accounting Standards	J-GAAP 58%, IFRS 38%, US-GAAP 4%	—	—	—
Avg. program steps	2.59	2.84	2.00	2.54
Avg. table rows	13.3	—	—	—
Max program steps	6 (DuPont)	—	—	—

Baseline Results

Model	Overall	Numerical Reasoning	Consistency Checking	Temporal Reasoning
GPT-4o	87.0%	80.2%	90.5%	99.2%
Gemini 2.0 Flash	80.4%	86.2%	83.5%	65.2%
GPT-4o-mini	67.7%	79.3%	83.5%	29.6%
Qwen2.5-3B-Instruct	39.6%	46.4%	51.0%	15.6%

1000 questions, zero-shot, temperature=0. Evaluation uses numerical matching with 1% tolerance. Qwen2.5-3B-Instruct run locally with MLX (4-bit quantization).

View full leaderboard →

Error Analysis

Systematic error analysis revealed both benchmark design issues and genuine LLM failure patterns.

Key findings:

Clear capability gradient: GPT-4o (87%) > Gemini 2.0 Flash (80%) > GPT-4o-mini (68%) >> Qwen2.5-3B (40%), validating the benchmark discriminates across model sizes and capabilities.
Temporal reasoning separates frontier models: GPT-4o achieves 99.2% on TR, while Gemini drops to 65.2% and GPT-4o-mini to 29.6%. This subtask requires strict output format compliance ("増収"/"減収" rather than "はい"/"いいえ"), which strongly differentiates models.
Gemini 2.0 Flash leads on numerical reasoning (86.2% vs GPT-4o's 80.2%), suggesting strong arithmetic capabilities, but falls behind on consistency checking and temporal reasoning where format compliance matters more.
DuPont decomposition is the hardest subtask: 6-step ROE decomposition questions (56 questions) see significant accuracy drops even for frontier models, while 3B models rarely solve them correctly.
GPT-4o-mini has a systematic prompt compliance issue in temporal reasoning. It answers "はい" (yes) to questions like "増収か減収か？" despite correctly analyzing the direction in its reasoning chain (122 of 176 TR errors follow this pattern).
J-GAAP balance sheet structure is a major error source. Models confuse 純資産合計 (net assets) with 株主資本 (shareholders' equity), and decompose 総資産 into 4 sub-categories instead of the standard 2.
Qwen2.5-3B-Instruct struggles most with temporal reasoning (15.6%) and consistency checking (51.0%), suggesting that smaller models have difficulty with instruction-following and multi-step verification tasks in Japanese.

Key Features

FinQA-compatible: Same data format as FinQA for cross-benchmark comparison
Japan-specific: Handles J-GAAP, IFRS, US-GAAP, and Japanese number formats (百万円, 億円, △)
Dual evaluation: Exact match and numerical match with tolerance
lm-evaluation-harness integration: Ready-to-use YAML task configs
Source provenance: Every question links back to its EDINET filing

Quick Start

Installation

pip install jfinqa
# or
uv add jfinqa

Evaluate Your Model

from jfinqa import load_dataset, evaluate

# Load benchmark questions
questions = load_dataset("numerical_reasoning")

# Provide predictions
predictions = {"nr_001": "25.0%", "nr_002": "16.0%"}
result = evaluate(questions, predictions=predictions)
print(result.summary())

Or Use a Model Function

from jfinqa import load_dataset, evaluate

questions = load_dataset()

def my_model(question: str, context: str) -> str:
    # Your model inference here
    return "42.5%"

result = evaluate(questions, model_fn=my_model)
print(result.summary())

CLI

# Inspect dataset questions
jfinqa inspect -s numerical_reasoning -n 5

# Evaluate predictions file
jfinqa evaluate -p predictions.json

# Evaluate with local data
jfinqa evaluate -p predictions.json -d local_data.json -s numerical_reasoning

lm-evaluation-harness

PR #3570 is pending. Once merged:

lm-eval run --model openai-completions \
    --model_args model=gpt-4o \
    --tasks jfinqa \
    --num_fewshot 0

Before merge, use --include_path:

lm-eval run --model openai-completions \
    --model_args model=gpt-4o \
    --tasks jfinqa \
    --num_fewshot 0 \
    --include_path lm_eval_tasks/

Data Format

Each question follows the FinQA schema with additional metadata:

{
  "id": "nr_001",
  "subtask": "numerical_reasoning",
  "pre_text": ["以下はA社の連結損益計算書の抜粋である。"],
  "post_text": ["当期は前期比で増収増益となった。"],
  "table": {
    "headers": ["", "2024年3月期", "2023年3月期"],
    "rows": [
      ["売上高", "1,500,000", "1,200,000"],
      ["営業利益", "200,000", "150,000"]
    ]
  },
  "qa": {
    "question": "2024年3月期の売上高成長率は何%か？",
    "program": ["subtract(1500000, 1200000)", "divide(#0, 1200000)", "multiply(#1, 100)"],
    "answer": "25.0%",
    "gold_evidence": [0]
  },
  "edinet_code": "E00001",
  "filing_year": "2024",
  "accounting_standard": "J-GAAP"
}

Japanese Number Handling

jfinqa correctly normalizes Japanese financial number formats:

Input	Extracted Value	Notes
`△1,000`	-1,000	Triangle negative marker
`１２，３４５`	12,345	Fullwidth digits + comma removal
`24,956百万円`	24,956	Compound financial units treated as labels
`50億`	5,000,000,000	Bare kanji multiplier applied
`42.5%`	42.5	Percentage

Development

git clone https://github.com/ajtgjmdjp/jfinqa
cd jfinqa
uv sync --dev --extra dev
uv run pytest -v
uv run ruff check .
uv run mypy src/

Data Attribution

Source financial data is obtained from EDINET (Electronic Disclosure for Investors' NETwork), operated by the Financial Services Agency of Japan (金融庁). EDINET data is provided under the Public Data License 1.0.

The data format is compatible with FinQA (Chen et al., 2021).

Related Projects

FinQA — English financial QA benchmark (Chen et al., 2021)
TAT-QA — Tabular and textual QA
edinet-mcp — EDINET XBRL parser (companion project)
EDINET-Bench — Sakana AI's financial classification benchmark

Citation

If you use jfinqa in your research, please cite it as follows:

@dataset{jfinqa2025,
  title={jfinqa: Japanese Financial Numerical Reasoning QA Benchmark},
  author={ajtgjmdjp},
  year={2025},
  url={https://github.com/ajtgjmdjp/jfinqa},
  license={Apache-2.0}
}

License

Apache-2.0. See NOTICE for third-party attributions.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
examples		examples
lm_eval_tasks		lm_eval_tasks
scripts		scripts
src/jfinqa		src/jfinqa
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jfinqa

What is this?

Three Subtasks

Dataset Statistics

Baseline Results

Error Analysis

Key Features

Quick Start

Installation

Evaluate Your Model

Or Use a Model Function

CLI

lm-evaluation-harness

Data Format

Japanese Number Handling

Development

Data Attribution

Related Projects

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

jfinqa

What is this?

Three Subtasks

Dataset Statistics

Baseline Results

Error Analysis

Key Features

Quick Start

Installation

Evaluate Your Model

Or Use a Model Function

CLI

lm-evaluation-harness

Data Format

Japanese Number Handling

Development

Data Attribution

Related Projects

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages