A comprehensive benchmarking system for evaluating Large Language Models (LLMs) on finite element method (FEM) tasks.
NOTE: this is a work in progress, new tasks and results will be posted as they are created. If you have questions or comments, please feel free to contact me at elejeune@bu.edu!
- Overview
- FEM-Bench Setup Instructions
- Reproducing Our Results
- Extending to Additional LLMs
- Creating New Tasks
- Citation Info
- TODO List
FEM-bench evaluates LLMs through a dual-task approach:
- Implementation tasks: Generate correct finite element functions (shape functions, numerical integration, etc.)
- Test generation: Write comprehensive pytest tests that validate mathematical properties (partition of unity, interpolation conditions, etc.)
You can think of FEM-Bench as having three parts:
-
The core software: This contains all logic for loading tasks, generating prompts, evaluating function correctness, and computing benchmark metrics. It is designed to be model-agnostic, reproducible, and modular, enabling consistent evaluation regardless of the LLM used.
-
The LLM API evaluation: API clients for models like GPT-4, Deepseek, Claude, and Gemini are isolated in a separate module to support easy extension, cleaner testing, and secure handling of API keys. This separation ensures that model-specific logic doesn’t pollute the core benchmarking pipeline and allows offline re-evaluation using saved outputs.
-
Tasks: Each task defines a reference implementation, test cases, and metadata for both code and test generation. These form the basis for evaluating LLM performance on well-defined FEM-related coding challenges.
A schematic of the FEM-Bench workflow is shown here:
A major goal of this tool is to make it easy to create and deploy new Tasks, ensuring the system stays relevant and highly extensible.
fem-bench/
├── .env # API keys for LLM access
├── fem_bench_env/ # Virtual environment
├── LICENSE # License file
├── llm_api/ # API client wrappers for LLMs
├── llm_outputs/ # LLM responses
├── prompt_templates/ # Jinja2 templates for prompts
├── prompts/ # Generated prompts
├── pyproject.toml # Project metadata and dependencies
├── README.md # Project README
├── results/ # Evaluation results
├── run_pipeline.py # Script to run the full benchmarking pipeline
├── src/
│ └── fem_bench/
│ ├── __init__.py
│ ├── evaluate_output.py # Evaluation logic
│ ├── pipeline_utils.py # Pipeline orchestration
│ ├── task_base.py # Core task definitions
│ ├── task_loader.py # Task loading utilities
│ ├── task_to_prompt.py # Prompt generation
│ └── fem_bench.egg-info/ # Metadata for installed package
├── task_template.py # Template for defining new tasks
├── tasks/ # Task definitions
└── tests/ # Test suite
- Python 3.10+ (3.11 and 3.12 should work, 3.10 has been tested most extensively)
-
Clone and setup:
git clone https://github.com/elejeune11/FEM-bench cd fem-bench # Create virtual environment python3.10 -m venv fem_bench_env source fem_bench_env/bin/activate # Linux/Mac # fem_bench_env\Scripts\activate # Windows # Install package pip install --upgrade pip pip install -e ".[dev]" # Install required packages for LLM API clients pip install -r requirements.txt
-
Verify installation:
python -c "import fem_bench; print('FEM-Bench installed successfully')" pytest --cov=fem_bench --cov-report=term-missing -v tests/
Deactivating Environment
deactivate
rm -rf fem_bench_env # To completely remove- Function Correctness (✓ = Match): Indicates whether each model's generated function produced outputs that exactly matched the reference implementation on all verification inputs.
- Joint Test Success Rate (%): Shows the percentage of model-generated test functions that both (1) passed on the reference implementation and (2) failed on all known-broken implementations. This metric captures tests that successfully distinguish correct from incorrect solutions. (Note: this does not guarantee comprehensive coverage — only a curated set of failure cases are tested.)
| Task | gemini-3-pro-preview | gemini-2.5-pro | claude-opus-4.5 | claude-haiku-4.5 | gpt-5 | gpt-5-mini | qwen3-coder | qwen3-next-80b | llama-4-maverick | llama-4-scout |
|---|---|---|---|---|---|---|---|---|---|---|
| FEM 1D | ||||||||||
| linear_elastic_CC0_H0_T0 | ✓ | ✓ | ✓ | × | ✓ | ✓ | ✓ | ✓ | ✓ | × |
| local_elastic_stiffness_CC0_H3_T1 | ✓ | ✓ | ✓ | × | × | × | ✓ | × | × | ✓ |
| uniform_mesh_CC0_H0_T0 | ✓ | ✓ | ✓ | ✓ | × | ✓ | ✓ | ✓ | ✓ | ✓ |
| FEM 2D | ||||||||||
| quad8_element_distributed_load_CC0_H0_T0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × | × | × |
| quad8_integral_of_derivative_CC0_H3_T3 | ✓ | ✓ | × | × | × | ✓ | × | ✓ | × | × |
| quad8_mesh_rectangle_CC0_H0_T0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × | × |
| quad8_physical_gradient_CC0_H1_T3 | ✓ | ✓ | × | × | ✓ | ✓ | × | × | × | × |
| quad8_shape_fcns_and_derivatives_CC0_H0_T0 | ✓ | ✓ | ✓ | × | ✓ | ✓ | × | × | × | × |
| quad_quadrature_CC0_H0_T0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × |
| tri6_mesh_rectangle_CC0_H0_T0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × | × | × |
| tri6_shape_fcns_and_derivatives_CC0_H0_T0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × | ✓ | × |
| tri_quadrature_CC0_H0_T0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| MSA 3D | ||||||||||
| assemble_global_geometric_stiffness_CC1_H4_T1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| assemble_global_geometric_stiffness_CC1_H4_T2 | × | × | ✓ | × | × | × | × | × | × | × |
| assemble_global_geometric_stiffness_CC1_H4_T3 | × | × | × | × | × | × | × | × | × | × |
| assemble_global_linear_elastic_stiffness_CC0_H2_T1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × |
| assemble_global_linear_elastic_stiffness_CC0_H2_T3 | ✓ | × | ✓ | × | ✓ | × | × | × | × | × |
| assemble_global_load_CC0_H0_T0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × |
| elastic_critical_load_CC1_H10_T1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × |
| elastic_critical_load_CC1_H10_T2 | ✓ | × | ✓ | × | × | × | × | × | × | × |
| elastic_critical_load_CC1_H10_T3 | × | × | × | × | × | × | × | × | × | × |
| linear_elastic_CC0_H6_T1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| linear_elastic_CC0_H6_T3 | ✓ | ✓ | ✓ | × | × | ✓ | × | × | × | × |
| local_elastic_stiffness_CC0_H0_T0 | ✓ | ✓ | ✓ | ✓ | × | × | ✓ | × | ✓ | × |
| local_element_loads_CC0_H2_T1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × |
| local_element_loads_CC0_H2_T3 | ✓ | ✓ | ✓ | × | ✓ | ✓ | × | × | × | × |
| local_geometric_stiffness_CC1_H0_T0 | × | × | × | × | × | × | × | × | × | × |
| partition_DOFs_CC0_H0_T0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × |
| solve_eigenvalue_CC1_H1_T1 | ✓ | ✓ | ✓ | × | ✓ | ✓ | ✓ | × | ✓ | × |
| solve_eigenvalue_CC1_H1_T3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × |
| solve_linear_CC0_H1_T1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × | × | × |
| solve_linear_CC0_H1_T3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| transformation_matrix_CC0_H0_T0 | ✓ | × | ✓ | ✓ | × | ✓ | × | ✓ | × | × |
| Total | 29/33 | 26/33 | 28/33 | 19/33 | 22/33 | 25/33 | 21/33 | 16/33 | 16/33 | 6/33 |
| Task | gemini-3-pro-preview | gemini-2.5-pro | claude-opus-4.5 | claude-haiku-4.5 | gpt-5 | gpt-5-mini | qwen3-coder | qwen3-next-80b | llama-4-maverick | llama-4-scout |
|---|---|---|---|---|---|---|---|---|---|---|
| FEM 1D | ||||||||||
| linear_elastic_CC0_H0_T0 | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 50.0% | 50.0% | – | 100.0% |
| local_elastic_stiffness_CC0_H3_T1 | 100.0% | 100.0% | 100.0% | 100.0% | 0.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| uniform_mesh_CC0_H0_T0 | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| FEM 2D | ||||||||||
| quad8_element_distributed_load_CC0_H0_T0 | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 50.0% | 50.0% | 50.0% | 0.0% | 0.0% |
| quad8_integral_of_derivative_CC0_H3_T3 | 66.7% | 66.7% | 100.0% | 33.3% | 100.0% | 33.3% | 0.0% | 66.7% | 33.3% | 0.0% |
| quad8_mesh_rectangle_CC0_H0_T0 | 100.0% | 66.7% | 66.7% | 66.7% | 100.0% | 66.7% | 66.7% | 66.7% | 66.7% | 33.3% |
| quad8_physical_gradient_CC0_H1_T3 | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 50.0% | 100.0% | 100.0% | 50.0% |
| quad8_shape_fcns_and_derivatives_CC0_H0_T0 | 100.0% | 83.3% | 100.0% | 100.0% | 100.0% | 100.0% | 83.3% | 83.3% | 83.3% | 50.0% |
| quad_quadrature_CC0_H0_T0 | 40.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 60.0% | 60.0% | 100.0% |
| tri6_mesh_rectangle_CC0_H0_T0 | 66.7% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 66.7% | 100.0% | 66.7% |
| tri6_shape_fcns_and_derivatives_CC0_H0_T0 | 100.0% | 100.0% | 100.0% | 50.0% | 83.3% | 100.0% | 50.0% | 50.0% | 50.0% | 33.3% |
| tri_quadrature_CC0_H0_T0 | 40.0% | 40.0% | 40.0% | 100.0% | 60.0% | 100.0% | 100.0% | 100.0% | 60.0% | 60.0% |
| MSA 3D | ||||||||||
| assemble_global_geometric_stiffness_CC1_H4_T1 | 0.0% | 50.0% | 0.0% | 0.0% | 100.0% | 0.0% | 50.0% | 50.0% | 0.0% | 0.0% |
| assemble_global_geometric_stiffness_CC1_H4_T2 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | – |
| assemble_global_geometric_stiffness_CC1_H4_T3 | 0.0% | 0.0% | – | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |
| assemble_global_linear_elastic_stiffness_CC0_H2_T1 | 100.0% | 100.0% | 100.0% | 0.0% | 100.0% | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% |
| assemble_global_linear_elastic_stiffness_CC0_H2_T3 | 100.0% | 100.0% | 0.0% | 0.0% | 100.0% | 100.0% | 0.0% | 100.0% | 0.0% | 0.0% |
| assemble_global_load_CC0_H0_T0 | 100.0% | – | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| elastic_critical_load_CC1_H10_T1 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | – | 0.0% | 0.0% |
| elastic_critical_load_CC1_H10_T2 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | – | 0.0% | 0.0% |
| elastic_critical_load_CC1_H10_T3 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | – | 0.0% | 0.0% |
| linear_elastic_CC0_H6_T1 | 100.0% | 100.0% | 100.0% | 50.0% | 100.0% | 50.0% | 0.0% | 0.0% | 0.0% | 0.0% |
| linear_elastic_CC0_H6_T3 | 100.0% | 50.0% | 100.0% | 50.0% | 100.0% | 100.0% | 0.0% | 0.0% | 0.0% | 0.0% |
| local_elastic_stiffness_CC0_H0_T0 | 50.0% | – | 100.0% | 0.0% | 100.0% | – | 0.0% | 0.0% | – | 0.0% |
| local_element_loads_CC0_H2_T1 | 100.0% | 50.0% | 100.0% | 0.0% | 50.0% | 75.0% | 0.0% | 50.0% | 75.0% | 75.0% |
| local_element_loads_CC0_H2_T3 | 100.0% | 100.0% | 100.0% | 0.0% | 75.0% | 100.0% | 0.0% | 50.0% | 75.0% | 75.0% |
| local_geometric_stiffness_CC1_H0_T0 | 50.0% | – | 50.0% | 50.0% | 50.0% | 0.0% | 50.0% | 50.0% | 0.0% | 0.0% |
| partition_DOFs_CC0_H0_T0 | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 0.0% | 100.0% | 0.0% |
| solve_eigenvalue_CC1_H1_T1 | 100.0% | 40.0% | 100.0% | 0.0% | 100.0% | 100.0% | 20.0% | 100.0% | 100.0% | 100.0% |
| solve_eigenvalue_CC1_H1_T3 | 100.0% | 80.0% | 100.0% | 100.0% | 100.0% | 100.0% | 60.0% | 100.0% | 80.0% | 100.0% |
| solve_linear_CC0_H1_T1 | 50.0% | 100.0% | 50.0% | 0.0% | 50.0% | 50.0% | 0.0% | 0.0% | 50.0% | 50.0% |
| solve_linear_CC0_H1_T3 | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 50.0% | 0.0% | 0.0% | 0.0% |
| transformation_matrix_CC0_H0_T0 | 100.0% | 66.7% | 66.7% | 33.3% | 66.7% | 33.3% | 0.0% | 33.3% | 33.3% | 33.3% |
| Avg Joint Success % | 71.6% | 63.4% | 71.9% | 49.5% | 73.8% | 65.4% | 41.8% | 49.3% | 41.4% | 37.2% |
| Task | Gemini 3 Pro (Preview) | Claude Opus 4.5 | GPT 5 |
|---|---|---|---|
| FEM 1D | |||
| 🟩 linear elastic (T0) | 5/5 | 5/5 | 5/5 |
| 🟡 uniform mesh (T0) | 5/5 | 4/5 | 0/5 |
| 🟡 local elastic stiffness (T1) | 2/5 | 5/5 | 0/5 |
| FEM 2D | |||
| 🟩 tri quadrature (T0) | 5/5 | 5/5 | 5/5 |
| 🟩 tri6 mesh rectangle (T0) | 5/5 | 5/5 | 5/5 |
| 🟩 tri6 shape fcns and derivatives (T0) | 5/5 | 5/5 | 5/5 |
| 🟩 quad quadrature (T0) | 5/5 | 4/5 | 5/5 |
| 🟩 quad8 element distributed load (T0) | 5/5 | 4/5 | 5/5 |
| 🟩 quad8 mesh rectangle (T0) | 5/5 | 4/5 | 5/5 |
| 🟩 quad8 shape fcns and derivatives (T0) | 5/5 | 4/5 | 5/5 |
| 🟡 quad8 integral of derivative (T3) | 4/5 | 2/5 | 4/5 |
| 🟡 quad8 physical gradient (T3) | 5/5 | 0/5 | 3/5 |
| MSA 3D | |||
| 🟩 assemble global geometric stiffness (T1) | 5/5 | 5/5 | 5/5 |
| 🟩 assemble global linear elastic stiffness (T1) | 5/5 | 5/5 | 5/5 |
| 🟩 assemble global load (T0) | 5/5 | 5/5 | 5/5 |
| 🟩 elastic critical load (T1) | 5/5 | 5/5 | 5/5 |
| 🟩 linear elastic (T1) | 5/5 | 5/5 | 5/5 |
| 🟩 local element loads (T3) | 5/5 | 5/5 | 5/5 |
| 🟩 solve eigenvalue (T1) | 5/5 | 5/5 | 5/5 |
| 🟩 assemble global linear elastic stiffness (T3) | 5/5 | 5/5 | 4/5 |
| 🟩 local element loads (T1) | 5/5 | 5/5 | 4/5 |
| 🟩 partition DOFs (T0) | 5/5 | 4/5 | 5/5 |
| 🟩 solve eigenvalue (T3) | 5/5 | 4/5 | 5/5 |
| 🟩 solve linear (T1) | 5/5 | 4/5 | 5/5 |
| 🟩 solve linear (T3) | 5/5 | 3/5 | 5/5 |
| 🟡 assemble global geometric stiffness (T2) | 4/5 | 5/5 | 3/5 |
| 🟡 linear elastic (T3) | 5/5 | 4/5 | 3/5 |
| 🟡 local elastic stiffness (T0) | 5/5 | 4/5 | 3/5 |
| 🟡 transformation matrix (T0) | 5/5 | 5/5 | 2/5 |
| 🟡 elastic critical load (T2) | 4/5 | 4/5 | 2/5 |
| ❌ assemble global geometric stiffness (T3) | 0/5 | 0/5 | 0/5 |
| ❌ elastic critical load (T3) | 0/5 | 0/5 | 0/5 |
| ❌ local geometric stiffness (T0) | 0/5 | 0/5 | 0/5 |
| Tasks Solved (any success) | 30/33 | 29/33 | 28/33 |
| Tasks Solved (5/5 success) | 26/33 | 16/33 | 19/33 |
To use the LLM APIs, you must create a .env file at the root of your project with your own API keys:
OPENAI_API_KEY=your_openai_key_here
GEMINI_API_KEY=your_google_gemini_key_here
CLAUDE_API_KEY=your_anthropic_key_here
TOGETHER_API_KEY=your_together_ai_key_hereThese keys are not included in the repo for security. Each client loads the appropriate key using dotenv. Missing keys will raise a clear error during runtime. The current code supports ten models: gemini-3-pro-preview, gemini-2.5-pro, claude-opus-4.5, claude-haiku-4.5, gpt-5, gpt-5-mini, qwen3-coder, qwen3-next-80b, llama-4-maverick, and llama-4-scout. You can modify the code to change this, and you only need to include keys for the models you plan to use.
The run_pipeline.py script automates a full LLM benchmarking cycle:
- Load tasks from the
tasks/directory. - Generate prompts for function and test synthesis, saved to
prompts/. - Call LLMs to generate code and test files for each task:
- Outputs are saved to
llm_outputs/ - Skips generation if outputs already exist
- Outputs are saved to
- Load generated completions into the evaluation pipeline.
- Evaluate generated functions against reference outputs.
- Evaluate generated test files against correct and intentionally broken implementations.
- Aggregate results and generate a Markdown summary in
results/.
You can run the full benchmark directly with:
python run_pipeline.pyOutputs will be saved to:
llm_outputs/for model completionsresults/for evaluation scores andevaluation_summary.md
The table below summarizes all key methods available through the FEMBenchPipeline class. These are the only methods needed to run the full benchmark pipeline:
| Method | Purpose |
|---|---|
load_all_tasks() |
Load all tasks from the specified tasks_dir and populate the internal registry. |
generate_and_save_task_prompts() |
Create and save code-generation prompts (*_code_prompt.txt). |
generate_and_save_test_prompts() |
Create and save test-generation prompts (*_test_prompt.txt). |
load_all_llm_outputs(allowed_llms=None) |
Load LLM-generated Python outputs from the output directory. Supports optional LLM filtering. |
evaluate_all_llm_outputs() |
Evaluate each LLM-generated function against the task reference function and store match results. |
evaluate_all_llm_tests() |
Evaluate LLM-generated test functions by running them against both reference and intentionally incorrect implementations. |
compute_aggregate_score() |
Compute summary metrics for each LLM including correctness, test pass rate, and expected failure detection rate. |
create_markdown_summary(filename="evaluation_summary.md") |
Write a Markdown report of all results to results_dir. |
This repo is designed with modularity in mind: all model-specific API logic is isolated in the llm_api/ folder, keeping the core benchmarking pipeline clean and stable.
To change the models used in the benchmark, update the MODEL_NAMES list in your pipeline script (e.g., ["gpt-5", "gemini-3-pro-preview", ...]).
To add a new model:
- Create a new
*_client.pyfile inllm_api/that definescall_<model>_for_code()andcall_<model>_for_tests(). - Update
llm_clients.pyto route the newmodel_namestring to your client functions.
This setup makes it easy to support new providers, customize request/response handling, and preserve a unified interface for generating both function and test completions.
Each task defines a single finite element related computation, including a reference implementation, test cases, and known failure modes. Tasks are modular and self-contained, enabling evaluation of both function generation and test synthesis. Examples include mesh generation, element stiffness assembly, and complete problem solvers. Before creating a new task, we recommend looking at already created tasks as examples.
To create a new task, you can start by copying the template:
cp task_template.py tasks/my_new_task.pyEach task file contains:
- A reference implementation: The correct function to be learned or reproduced by the model.
- Helper functions (optional): Dependencies used by the main function, such as shape functions or numerical integrators.
- Reference verification inputs: A list of example inputs used to evaluate output correctness.
- Test functions: Pytest-style tests designed to check correctness, robustness, and behavior.
- Expected failure cases: One or more incorrect implementations that the test functions should catch.
- Metadata: Information like task ID, author, creation date, and a short description.
All components are bundled together in a task_info() function that returns structured metadata.
Each Task defines a Python function to implement and is automatically converted into two structured prompts: one for function generation, and one for test synthesis. The code-generation prompt includes the target function’s exact signature and docstring, plus any helper functions and import restrictions. The test-generation prompt presents the same function and lists all test functions to implement, along with their names and docstrings.
To add a new Task, you only need to define its metadata, reference implementation, and test cases — the system handles formatting and prompt generation automatically. However, keep in mind that the information included in the function docstring will make it into the prompt.
Very simple Task:
Slightly more complicated Task:
The most complicated Task with no helper funcion:
- FEM_1D_linear_elastic_CC0_H0_T0_code_prompt.txt
- FEM_1D_local_elastic_stiffness_CC0_H3_T1_code_prompt.txt
- MSA_3D_elastic_critical_load_CC1_H10_T3_code_prompt.txt
- FEM_1D_linear_elastic_CC0_H0_T0_test_prompt.txt
- FEM_1D_local_elastic_stiffness_CC0_H3_T1_test_prompt.txt
- MSA_3D_elastic_critical_load_CC1_H10_T3_test_prompt.txt
After we have more Tasks completed, we will prepare a manuscript on our results. For now, if you use our work please cite the Zenodo concept DOI:
@software{lejeune2025fem_bench,
author = {Emma Lejeune},
title = {FEM-Bench: A Comprehensive Benchmarking System for Evaluating Large Language Models on Finite Element Method Tasks},
url = {https://zenodo.org/records/16732264},
doi = {10.5281/zenodo.16732264},
year = {2025},
publisher = {Zenodo},
note = {Software release (all versions)},
keywords = {finite element method, large language models, benchmarking, machine learning}
}
