Prompt Evaluation Cases

This repository collects small, real-world prompt cases used to examine how large language models interpret language when context is ambiguous, underspecified, or culturally local.

Instead of benchmarking models or measuring performance, the focus here is on interpretation problems: cases where a prompt can be read in more than one way, and where the model must decide whether to infer, clarify, or risk being wrong.

Many of the examples deal with:

Lexical and semantic ambiguity
Toponymy and local references (place names, teams, expressions)
Plausible but incorrect interpretations
Cases where internal coherence masks an external error

Each folder represents an independent case study based on real prompts and real interactions, with a short analysis of the observed behavior and why it matters.

The goal is not to rank models, but to document recurring failure modes in prompt interpretation that are easy to miss and hard to catch with standard tests.

Most original prompts in this repository are written in Spanish, where many language-level issues, such as ambiguity, disambiguation, cultural context, and reasoning, tend to emerge more clearly in everyday usage.

If you are using this repository for prompt evaluation, LLM comparison, or QA work, notes or observations are welcome via issues.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
cases/gpt4-vs-claude3		cases/gpt4-vs-claude3
huracan-balazote		huracan-balazote
pepino-cebolla		pepino-cebolla
tip-y-coll-gobierno		tip-y-coll-gobierno
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prompt Evaluation Cases

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Prompt Evaluation Cases

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages