Skip to content

networkslab/ttbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TTBench: LLM Benchmark for Test-Time-Compute

This is a repository for TTBench, the Test-Time Compute Benchmark.

This benchmark features CoT queried for multiple LLMs on a variety of mathematical and reasoning datasets. The few-shot query process and answer extraction are standardised for every dataset, which eases the burden on researchers in terms of time and money.

Installation

Please, install this becnhmark from the source.

pip install .

It requires api_responses.zip (download from Google Drive) file containing a database. For the following example, let us assume this file is in your code directory.

Example

from ttbench import load, DatasetType, LLMType
    
dataset, [llm1,llm2] = load(DatasetType.SVAMP, [LLMType.LLaMA3B32, LLMType.Qwen72B25], api_path="api_responses.zip")

for question_id, dataentry in dataset:
    print("Question: ", dataentry.question)
    print("True answer: ", dataentry.answer)
    llm1_response = llm1(question_id, N=20)
    print("Cost: $", llm1_response.cost)
    print("1st CoT answer: ",  llm1_response.cots[0].answer)

Refer to examples folder for more examples of the benchmark evaluation

Cost modelling

We also provide a procedure to model the dollar cost for each query. This ensures the fair comparison between test-time-compute methods.

from ttbench import load, DatasetType, LLMType

dataset, [llm, ] = load(DatasetType.CommonsenseQA, [LLMType.Mixtral8x7B], api_path="api_responses.zip")

question_id = 42
response = llm(question_id, N=2)

print(f"Request processing cost: ${response.request.cost:0.9f}")
print(f"First CoT response cost: ${response.cots[0].metadata.cost:0.9f}")
print(f"Total LLM query cost: ${response.cost:0.9f}")

About

TTBench is an LLM benchmark to make comparison of test-time-compute methodologies

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages