Skip to content

networkslab/feval_ttc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FEval-TTC: Fair Evaluation Protocol for Test-Time Compute

This is a repository for FEval-TTC, the Fair Evaluation protocol for Test-Time Compute.

This evaluation framework features CoT queried for multiple LLMs on a variety of mathematical and reasoning datasets. The few-shot query process and answer extraction are standardised for every dataset, which eases the burden on researchers in terms of time and money.

Installation

Please, install this package from the source.

pip install .

It requires api_responses.zip (download from Google Drive) file containing a database. For the following example, let us assume this file is in your code directory.

Example

from feval_ttc import load, DatasetType, LLMType
    
dataset, [llm1,llm2] = load(DatasetType.SVAMP, [LLMType.LLaMA3B32, LLMType.Qwen72B25], api_path="api_responses.zip")

for question_id, dataentry in dataset:
    print("Question: ", dataentry.question)
    print("True answer: ", dataentry.answer)
    llm1_response = llm1(question_id, N=20)
    print("1st CoT answer: ",  llm1_response.cots[0].answer)
    print("Token cost: ", llm1_response.cots[0].tokens)
    print("USD Cost: ", llm1_response.cots[0].dollar_cost)

Refer to examples folder for more examples of the benchmark evaluation

Citing

If you use this protocol in your project, please consider citing:

@inproceedings{rumiantsev2025fevalttc,
    title={{FE}val-{TTC}: {Fair Evaluation Protocol for Test-Time Compute}},
    author={Pavel Rumiantsev and Soumyasundar Pal and Yingxue Zhang and Mark Coates},
    booktitle={NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling},
    year={2025},
    url={https://openreview.net/forum?id=Fj9Ge7TdrY}
}

About

FEval-TTC: Fair Evaluation Protocol for Test-Time Compute

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages