Skip to content

mishafyi/hot-dog-or-not

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hot Dog or Not

Compare how vision models see and reason about the world — not just their accuracy scores.

Hot Dog or Not - Compare how vision models reason

License: MIT Python 3.11+ Node 18+ FastAPI Next.js 16

hotdogornot.xyz  |  Live Arena  |  Telegram Bot  |  ClawHub Skill


Same images. Same prompt. One question: is this a hot dog?

Each model answers and explains its reasoning. Compare traces side by side to see how different models perceive the same image — what they notice, what they miss, where they disagree.

Features

Benchmark Mode

Run multiple vision models against the same image set. Compare accuracy, latency, and reasoning traces side by side.

Battle Arena

Send a food photo and two AI models classify it independently in a blind cook-off. You (or your bot) judge which response is better — model identities are revealed after the vote. Rankings use the Bradley-Terry model with separate leaderboards for human and bot judges. The arena page shows brand logos (NVIDIA, Anthropic, Google) and ELO ratings for top-ranked models.

Telegram Bot

Send a photo to @HotDogNotHotDog_Bot on Telegram. Powered by Claude Haiku 4.5 via OpenClaw — classifies your image, battles NVIDIA Nemotron 12B, and sends you both verdicts with blind labels. Vote buttons appear a few seconds later so you pick a winner before knowing which model is which.

OpenClaw Skill

Install the hotdog skill on any OpenClaw-powered agent to add hot dog classification and battle capabilities. Your agent classifies the photo, then judges Nemotron's take — all blind.

npx clawhub@latest install hotdog

Why This Exists

Inspired by Silicon Valley's "Not Hotdog" app. Leaderboards show accuracy — this shows reasoning.

Is a corn dog a hot dog? A bratwurst in a bun? A deconstructed chili dog? Edge cases force models to reveal how they actually think. Compare traces side by side to see:

  • What they notice — bun, sausage shape, condiments?
  • How they reason — definition-based or pattern-matching?
  • Where they fail — which edge cases break which models?

The dataset is adversarial: bratwursts, corn dogs, wraps, hot dog look-alikes. Models pick a side and explain why.

What It Looks Like

Run page
Select models, set sample size, run

Live benchmark
4 models classify simultaneously with reasoning traces

Results dashboard
Accuracy, latency, and disagreement analysis

Quick Start

Requirements

Backend

cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # paste your OPENROUTER_API_KEY
uvicorn main:app --reload --port 8000

Frontend

cd frontend
npm install
npm run dev

Open localhost:3000/run, select models, and hit run.

Docker

docker compose up

Backend at localhost:8000, frontend at localhost:3000.

Models

Benchmark Models

All benchmark models run on OpenRouter free tier. Toggle on/off before each run.

Model Provider Params
Nemotron Nano VL NVIDIA 12B
Gemma 3 Google 27B
Gemma 3 Google 12B
Gemma 3 Google 4B

Add your own in backend/config.py. Any free vision model on OpenRouter works.

Battle Models

The arena always uses NVIDIA Nemotron 12B as the baseline. The challenger is whichever model powers the OpenClaw agent — currently Claude Haiku 4.5 on the public Telegram bot, but any model can compete via the ClawHub skill.

Battle API

The battle system pits two models against each other on user-submitted photos:

  • POST /api/battle/round — submit a photo with one model's answer, get Nemotron's independent classification back
  • POST /api/battle/vote/submit — submit a blind vote for the better response
  • GET /api/battle/leaderboard?voter_type=user — human-voted Bradley-Terry rankings
  • GET /api/battle/leaderboard?voter_type=arena — bot-voted Bradley-Terry rankings

Rate limited to 5 requests/minute per token. Images must be JPG/PNG/WebP/GIF, max 10MB.

Dataset

180 images from Pexels — 90 hot dogs, 90 not-hot-dogs. The not-hot-dog category is intentionally chosen to look similar: sausages, wraps, burritos, things with mustard.

The sample size slider sets images per category. Setting it to 5 means 5 hot dogs + 5 not-hot-dogs = 10 total images per model, interleaved.

Add your own images: drop jpg/png/webp into backend/data/test/hot_dog/ and backend/data/test/not_hot_dog/.

How It Works

Each image goes to the model with:

Look at the image. Is it a hot dog (food: a sausage served in a bun/roll; any cooking style)?

Output exactly:
Description: <brief description of what is visible>
Answer: <yes|no>

Temperature 0.0 for deterministic output. The answer is parsed for yes/no. Anything else counts as an error.

Metrics: accuracy (with 95% CI), precision, recall, F1, confusion matrix, mean/median/p95 latency.

Disagreements: images where models gave different answers, shown with side-by-side reasoning traces.

Project Structure

backend/          Python FastAPI app
  routers/        API endpoints (benchmark, battle, classify)
  services/       OpenRouter client, rate limiter, arena rankings
  config.py       Model definitions and settings
  results/        JSONL run data, battle images, votes
frontend/         Next.js 16 + shadcn/ui + Framer Motion
  src/app/        Pages: run, results, battle, gallery, about
  src/components/ ModelLogo, UI components
skills/           OpenClaw skill definitions (Telegram + ClawHub)
docker-compose.yml

Tech Stack

Backend

  • Python 3.11+
  • FastAPI
  • httpx (async HTTP)
  • Pydantic
  • arena-rank (Bradley-Terry)

Frontend

  • Next.js 16
  • React 19
  • Tailwind CSS
  • shadcn/ui
  • Framer Motion

API: OpenRouter free tier vision models Deployment: Docker + Coolify Bot: OpenClaw + Telegram

Acknowledgments

Silicon Valley (inspiration), OpenRouter (free models), Pexels (images), shadcn/ui (components), OpenClaw (bot framework), arena-rank (Bradley-Terry rankings)

License

MIT

About

Compare how vision models reason about images — not just their accuracy scores

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors