[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
-
Updated
Oct 30, 2025 - Python
[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
Yes, LLM's just regurgitate the same jokes on the internet over and over again. But some are slightly funnier than others.
GateBench is a challenging benchmark for Vision Language Models (VLMs) that tests visual reasoning by requiring models to extract boolean algebra expressions from logic gate circuit diagrams.
Benchmark LLMs Spatial Reasoning with Head-to-Head Bananagrams
A realistic benchmark for evaluating AI coding models on practical, real-world development challenges. If you're tired of benchmarks cluttered with silly, toy-like tasks—like drawing butterflies or animating balls in hexagons
自动收集 Bilibili 硬核会员答题数据并生成 LLM 评估数据集
Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."