Hot Dog or Not

Compare how vision models see and reason about the world — not just their accuracy scores.

hotdogornot.xyz | Live Arena | Telegram Bot | ClawHub Skill

Same images. Same prompt. One question: is this a hot dog?

Each model answers and explains its reasoning. Compare traces side by side to see how different models perceive the same image — what they notice, what they miss, where they disagree.

Features

Benchmark Mode

Run multiple vision models against the same image set. Compare accuracy, latency, and reasoning traces side by side.

Battle Arena

Send a food photo and two AI models classify it independently in a blind cook-off. You (or your bot) judge which response is better — model identities are revealed after the vote. Rankings use the Bradley-Terry model with separate leaderboards for human and bot judges. The arena page shows brand logos (NVIDIA, Anthropic, Google) and ELO ratings for top-ranked models.

Telegram Bot

Send a photo to @HotDogNotHotDog_Bot on Telegram. Powered by Claude Haiku 4.5 via OpenClaw — classifies your image, battles NVIDIA Nemotron 12B, and sends you both verdicts with blind labels. Vote buttons appear a few seconds later so you pick a winner before knowing which model is which.

OpenClaw Skill

Install the hotdog skill on any OpenClaw-powered agent to add hot dog classification and battle capabilities. Your agent classifies the photo, then judges Nemotron's take — all blind.

npx clawhub@latest install hotdog

Why This Exists

Inspired by Silicon Valley's "Not Hotdog" app. Leaderboards show accuracy — this shows reasoning.

Is a corn dog a hot dog? A bratwurst in a bun? A deconstructed chili dog? Edge cases force models to reveal how they actually think. Compare traces side by side to see:

What they notice — bun, sausage shape, condiments?
How they reason — definition-based or pattern-matching?
Where they fail — which edge cases break which models?

The dataset is adversarial: bratwursts, corn dogs, wraps, hot dog look-alikes. Models pick a side and explain why.

What It Looks Like

Select models, set sample size, run

4 models classify simultaneously with reasoning traces

Accuracy, latency, and disagreement analysis

Quick Start

Requirements

Python 3.11+
Node.js 18+
Free OpenRouter API key

Backend

cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # paste your OPENROUTER_API_KEY
uvicorn main:app --reload --port 8000

Frontend

cd frontend
npm install
npm run dev

Open localhost:3000/run, select models, and hit run.

Docker

docker compose up

Backend at localhost:8000, frontend at localhost:3000.

Models

Benchmark Models

All benchmark models run on OpenRouter free tier. Toggle on/off before each run.

Model	Provider	Params
Nemotron Nano VL	NVIDIA	12B
Gemma 3	Google	27B
Gemma 3	Google	12B
Gemma 3	Google	4B

Add your own in backend/config.py. Any free vision model on OpenRouter works.

Battle Models

The arena always uses NVIDIA Nemotron 12B as the baseline. The challenger is whichever model powers the OpenClaw agent — currently Claude Haiku 4.5 on the public Telegram bot, but any model can compete via the ClawHub skill.

Battle API

The battle system pits two models against each other on user-submitted photos:

POST /api/battle/round — submit a photo with one model's answer, get Nemotron's independent classification back
POST /api/battle/vote/submit — submit a blind vote for the better response
GET /api/battle/leaderboard?voter_type=user — human-voted Bradley-Terry rankings
GET /api/battle/leaderboard?voter_type=arena — bot-voted Bradley-Terry rankings

Rate limited to 5 requests/minute per token. Images must be JPG/PNG/WebP/GIF, max 10MB.

Dataset

180 images from Pexels — 90 hot dogs, 90 not-hot-dogs. The not-hot-dog category is intentionally chosen to look similar: sausages, wraps, burritos, things with mustard.

The sample size slider sets images per category. Setting it to 5 means 5 hot dogs + 5 not-hot-dogs = 10 total images per model, interleaved.

Add your own images: drop jpg/png/webp into backend/data/test/hot_dog/ and backend/data/test/not_hot_dog/.

How It Works

Each image goes to the model with:

Look at the image. Is it a hot dog (food: a sausage served in a bun/roll; any cooking style)?

Output exactly:
Description: <brief description of what is visible>
Answer: <yes|no>

Temperature 0.0 for deterministic output. The answer is parsed for yes/no. Anything else counts as an error.

Metrics: accuracy (with 95% CI), precision, recall, F1, confusion matrix, mean/median/p95 latency.

Disagreements: images where models gave different answers, shown with side-by-side reasoning traces.

Project Structure

backend/          Python FastAPI app
  routers/        API endpoints (benchmark, battle, classify)
  services/       OpenRouter client, rate limiter, arena rankings
  config.py       Model definitions and settings
  results/        JSONL run data, battle images, votes
frontend/         Next.js 16 + shadcn/ui + Framer Motion
  src/app/        Pages: run, results, battle, gallery, about
  src/components/ ModelLogo, UI components
skills/           OpenClaw skill definitions (Telegram + ClawHub)
docker-compose.yml

Tech Stack

Backend

Python 3.11+
FastAPI
httpx (async HTTP)
Pydantic
arena-rank (Bradley-Terry)

Frontend

Next.js 16
React 19
Tailwind CSS
shadcn/ui
Framer Motion

API: OpenRouter free tier vision models Deployment: Docker + Coolify Bot: OpenClaw + Telegram

Acknowledgments

Silicon Valley (inspiration), OpenRouter (free models), Pexels (images), shadcn/ui (components), OpenClaw (bot framework), arena-rank (Bradley-Terry rankings)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
backend		backend
docs		docs
frontend		frontend
scripts		scripts
skills		skills
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hot Dog or Not

Features

Benchmark Mode

Battle Arena

Telegram Bot

OpenClaw Skill

Why This Exists

What It Looks Like

Quick Start

Requirements

Backend

Frontend

Docker

Models

Benchmark Models

Battle Models

Battle API

Dataset

How It Works

Project Structure

Tech Stack

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hot Dog or Not

Features

Benchmark Mode

Battle Arena

Telegram Bot

OpenClaw Skill

Why This Exists

What It Looks Like

Quick Start

Requirements

Backend

Frontend

Docker

Models

Benchmark Models

Battle Models

Battle API

Dataset

How It Works

Project Structure

Tech Stack

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages