This is a repository for TTBench, the Test-Time Compute Benchmark.
This benchmark features CoT queried for multiple LLMs on a variety of mathematical and reasoning datasets. The few-shot query process and answer extraction are standardised for every dataset, which eases the burden on researchers in terms of time and money.
Please, install this becnhmark from the source.
pip install .It requires api_responses.zip (download from Google Drive) file containing a database.
For the following example, let us assume this file is in your code directory.
from ttbench import load, DatasetType, LLMType
dataset, [llm1,llm2] = load(DatasetType.SVAMP, [LLMType.LLaMA3B32, LLMType.Qwen72B25], api_path="api_responses.zip")
for question_id, dataentry in dataset:
print("Question: ", dataentry.question)
print("True answer: ", dataentry.answer)
llm1_response = llm1(question_id, N=20)
print("Cost: $", llm1_response.cost)
print("1st CoT answer: ", llm1_response.cots[0].answer)Refer to examples folder for more examples of the benchmark evaluation
We also provide a procedure to model the dollar cost for each query. This ensures the fair comparison between test-time-compute methods.
from ttbench import load, DatasetType, LLMType
dataset, [llm, ] = load(DatasetType.CommonsenseQA, [LLMType.Mixtral8x7B], api_path="api_responses.zip")
question_id = 42
response = llm(question_id, N=2)
print(f"Request processing cost: ${response.request.cost:0.9f}")
print(f"First CoT response cost: ${response.cots[0].metadata.cost:0.9f}")
print(f"Total LLM query cost: ${response.cost:0.9f}")