FinCDM (Financial Cognitive Diagnosis Model) is a comprehensive evaluation framework for financial large language models. It moves beyond traditional score-level evaluation by providing knowledge-skill level diagnosis, identifying what financial skills and knowledge models possess or lack.
This project introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development.
From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models
- 📖 Paper: Hugging Face Paper Page
- 📝 arXiv: arXiv:2508.13491
We introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs. Unlike existing benchmarks that rely on single aggregate scores, FinCDM evaluates models at the knowledge-skill level, revealing hidden knowledge gaps and identifying under-tested areas such as tax and regulatory reasoning often overlooked by traditional benchmarks.
We provide two comprehensive datasets for financial LLM evaluation:
- 🤗 Hugging Face: https://huggingface.co/datasets/NextGenWhu/FinCDM-FinEval-KQA
- Description:
A knowledge–skill annotated extension of the FinEval benchmark, designed to support fine-grained evaluation of financial reasoning capabilities. - Features:
- Fine-grained financial knowledge and skill labels
- Coverage of multiple financial sub-domains
- Expert-validated annotations
- 🤗 Hugging Face: https://huggingface.co/datasets/NextGenWhu/FinCDM-Fin-KQA
- Description:
A unified cognitively informed financial knowledge–skill evaluation dataset derived from professional certification examinations, integrating materials from both the Certified Public Accountant (CPA) and Chartered Financial Analyst (CFA) curricula. - Composition:
- CFA-KQA
- 123 evaluation questions
- File:
CFA-KQA.json - Focuses on advanced investment analysis, portfolio management, and professional ethics
- CPA-KQA
- 1,050 training questions
- File:
CPA-KQA-training.json
- File:
- 210 evaluation questions
- File:
CPA-KQA-test.json
- File:
- Covers real-world accounting, auditing, and financial reporting skills
- 1,050 training questions
- CFA-KQA
- Features:
- Comprehensive coverage of professional-level financial and accounting competencies
- Fine-grained knowledge and skill annotations
- Expert annotations with high inter-annotator agreement
- Designed to support both training and rigorous evaluation settings
- Cognitive diagnosis framework for financial LLMs
- Knowledge-skill level evaluation beyond simple scores
- Two comprehensive evaluation datasets (FinEval-KQA and CPA-KQA)
- Evaluation scripts and tools (coming soon)
- Model proficiency visualization
- Skill acquisition pattern analysis
- Behavioral cluster identification
- Knowledge-Skill Level Diagnosis : Unlike traditional benchmarks that provide single scores, FinCDM reveals specific strengths and weaknesses across different financial skills
- Comprehensive Coverage : Tests previously overlooked areas like:
- Tax and regulatory reasoning
- Deferred tax liabilities
- Lease classification
- Regulatory ratios
- Model Clustering Analysis : Identifies latent associations between financial concepts and reveals distinct clusters of models with similar skill acquisition patterns
- Python 3.8+
- Git
- PyTorch >= 1.12.0
- Transformers >= 4.25.0
- NumPy, Pandas, Scikit-learn
# Clone the repository
git clone https://github.com/WHUNextGen/FinCDM.git
cd FinCDM
# Install dependencies (once available)
pip install -r requirements.txtfrom datasets import load_dataset
# Load FinEval-KQA dataset
fineval_data = load_dataset("NextGenWhu/FinCDM-FinEval-KQA")
# Load CPA-KQA dataset
cpa_data = load_dataset("NextGenWhu/FinCDM-CPA-KQA")from fincdm import FinCDMEvaluator
# Initialize evaluator
evaluator = FinCDMEvaluator(data_root=".")
# Evaluate your model
results = FinCDMEvaluator().evaluate(
q_path="",
a_path="",
)
print(results.metrics)
# Get knowledge-skill diagnosis
diagnosis = evaluator.diagnose(results,export_csv="SK_df.csv")Our extensive experiments on 30+ LLMs including:
- Proprietary models (GPT-4, GPT-3.5, Claude)
- Open-source models (LLaMA, Mistral, Qwen)
- Domain-specific models (FinGPT, FinMA, FinQwen)
Key findings:
- Reveals hidden knowledge gaps in state-of-the-art models
- Identifies behavioral clusters among different model families
- Uncovers specialization strategies in domain-specific models
We welcome contributions! Please feel free to:
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
If you use FinCDM in your research, please cite our paper:
@article{fincdm2024,
title={From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models},
author={Kuang, Ziyan and others},
journal={arXiv preprint arXiv:2508.13491},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
- WHU NextGen Team
- Contributors from Wuhan University
- GitHub Repository : https://github.com/WHUNextGen/FinCDM
- Paper : https://huggingface.co/papers/2412.06264
- FinEval-KQA Dataset : https://huggingface.co/datasets/NextGenWhu/FinCDM-FinEval-KQA
- CPA-KQA Dataset : https://huggingface.co/datasets/NextGenWhu/FinCDM-CPA-KQA
For questions and feedback, please:
- Open an issue on GitHub
- Contact the WHU NextGen Team
⭐ Star this repository if you find it helpful!
🔥 Check out our datasets on Hugging Face for financial LLM evaluation!
