reproducing benchmarks

Hi, I'm trying to reproduce your benchmarks so that I can compare them to some modifications I'm curious about. In order to avoid costly API calls, I tried sideloading your informalizations found here into postgres https://huggingface.co/datasets/FrenzyMath/mathlib_informal_v4.16.0

I used some LLM agents to write code to get it working, and it's working now for search. 

I'm continuing to try and reproduce your results, partiularly the last row of table 5. Would you be able to to provide more scripts and data that you used for your benchmarks? I found the theorems used in the benchmark here: https://aclanthology.org/2024.findings-emnlp.470/ But I lack things like your augmented queries and scripts for computing metrics. I can come up with these on my own but I figured I'd ask if you could share yours.

Thanks,
Isaiah

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproducing benchmarks #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

reproducing benchmarks #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions