llm-benchmark

Star

Here are 10 public repositories matching this topic...

TsinghuaC3I / AdsQA

Star

[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621

benchmark video-understanding video-question-answering advertisement-dataset video-llms llm-benchmark

Updated Oct 30, 2025
Python

GustyCube / ERR-EVAL

Star

Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.

python nlp benchmark machine-learning ai evaluation collaborate ai-safety llm llm-evaluation hallucination-detection ai-benchmark llm-benchmark

Updated Dec 31, 2025
Python

RafaelParonis / jailbench

Star

🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.

python flask analytics openai alignment model-evaluation ai-safety security-testing red-teaming model-robustness anthropic litellm content-safety llm-jailbreaks tool-calling llm-benchmark ai-evals textual-tui

Updated Jan 1, 2026
Python

vibheksoni / jailbench

Star

Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.

Updated Aug 12, 2025
Python

demegire / funny-arena

Star

Yes, LLM's just regurgitate the same jokes on the internet over and over again. But some are slightly funnier than others.

arena humor elo jokes funny elo-rating humor-generation llm llm-evaluation llm-benchmark llm-humor llm-jokes

Updated Nov 9, 2025
Python

johnbean393 / GateBench

Star

GateBench is a challenging benchmark for Vision Language Models (VLMs) that tests visual reasoning by requiring models to extract boolean algebra expressions from logic gate circuit diagrams.

benchmark vlm llm vlm-benchmark llm-benchmark

Updated Dec 17, 2025
Python

ALucek / banana-bench

Star

Benchmark LLMs Spatial Reasoning with Head-to-Head Bananagrams

benchmarks llm llm-benchmark

Updated Dec 30, 2025
HTML

A realistic benchmark for evaluating AI coding models on practical, real-world development challenges. If you're tired of benchmarks cluttered with silly, toy-like tasks—like drawing butterflies or animating balls in hexagons

benchmarking benchmark benchmarks language-model llm llms llm-benchmarking llm-benchmark

Updated Oct 8, 2025

Puiching-Memory / bili-hardcore-benchmark

Star

自动收集 Bilibili 硬核会员答题数据并生成 LLM 评估数据集

python bilibili bilibili-hardcore llm-benchmark

Updated Dec 29, 2025
HTML

1337hero / rx7900xtx-llama-bench-rocm

Star

Benchmark script for llama.cpp & results for AMD RX 7900 XTX

linux benchmark amd llama amdgpu rocm radeon llm llamacpp llm-benchmarking llm-benchmark

Updated Dec 11, 2025
Shell

Improve this page

Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-benchmark

Here are 10 public repositories matching this topic...

TsinghuaC3I / AdsQA

GustyCube / ERR-EVAL

RafaelParonis / jailbench

vibheksoni / jailbench

demegire / funny-arena

johnbean393 / GateBench

ALucek / banana-bench

DedInc / rct-bench

Puiching-Memory / bili-hardcore-benchmark

1337hero / rx7900xtx-llama-bench-rocm

Improve this page

Add this topic to your repo