Better Score schema #28

reisepass · 2025-11-07T14:42:13Z

reisepass
Nov 7, 2025
Maintainer

Our score schema is too verbose

reisepass · 2025-11-07T14:43:41Z

reisepass
Nov 7, 2025
Maintainer Author

Related work

https://github.com/confident-ai/deepeval/blob/main/deepeval/tracing/api.py line 30

MetricData - deepeval/tracing/api.py:30-42
This is the core schema for evaluation scores:
class MetricData(BaseModel):
name: str
threshold: float
success: bool
score: Optional[float] = None
reason: Optional[str] = None
strict_mode: Optional[bool] = Field(False, alias="strictMode")
evaluation_model: Optional[str] = Field(None, alias="evaluationModel")
error: Optional[str] = None
evaluation_cost: Union[float, None] = Field(None, alias="evaluationCost")
verbose_logs: Optional[str] = Field(None, alias="verboseLogs")

TestRun - deepeval/test_run/test_run.py:126-409
This is the top-level schema that gets saved to disk containing:

Test cases: test_cases (List[LLMApiTestCase])
Scores: metrics_scores (List[MetricScores])
Prompts: prompts (List[PromptData])
Summary: test_passed, test_failed, run_duration, evaluation_cost

LLMApiTestCase - deepeval/test_run/api.py:9-97
Combines test case data with evaluation results:

Inherits prompt/response fields from LLMTestCase
Adds metrics_data (List[MetricData]) for scores
Adds success, run_duration, evaluation_cost

The TestRun.save() method (deepeval/test_run/test_run.py:389-396) saves data to:

Location: .deepeval/.temp_test_run_data.json and .deepeval/.latest_test_run.json
Format: JSON using Pydantic's model_dump(by_alias=True, exclude_none=True)

1 reply

reisepass Nov 7, 2025
Maintainer Author

Also logs to console but again very practical not looking to be forwardable

To Output Evaluation Results to Files

When using the evaluate() function in deepeval, you can specify a file output directory using the DisplayConfig:

from deepeval import evaluate
from deepeval.evaluate.configs import DisplayConfig

Your test cases and metrics setup here...

Evaluate with file output

result = evaluate(
test_cases=test_cases,
metrics=metrics,
display_config=DisplayConfig(
file_output_dir="./output" # Specify your output directory
)
)

What Gets Written

The evaluation results will be written to a timestamped log file (e.g., test_run_20240107_143022.log) in the specified directory. The file
includes:

Metrics Summary - For each test case:
- Metric name
- Score
- Threshold
- Whether it passed (✅) or failed (❌)
- Evaluation model used
- Reason
- Any errors

Test Case Details:
- Input (prompt)
- Actual output (response)
- Expected output
- Context
- Retrieval context

Overall Metric Pass Rates - Aggregated statistics across all test cases

The output file is located at: deepeval/evaluate/utils.py:339-437

Alternative: Print to Console

If you just want to see results in the console without writing to a file, you can use:

result = evaluate(
test_cases=test_cases,
metrics=metrics,
display_config=DisplayConfig(
print_results=True # This is the default
)
)

The evaluation results are also accessible programmatically through the returned EvaluationResult object, which contains test_results you can
process however you need.

mkaramuk · 2025-11-07T14:51:27Z

mkaramuk
Nov 7, 2025
Maintainer

Here an example of the result object of promptfoo (when using promptfoo eval --output results.json

// ...
{
  "cost": 0,
  "gradingResult": {
    "pass": true,
    "score": 1,
    "reason": "No assertions",
    "tokensUsed": {
      "total": 0,
      "prompt": 0,
      "completion": 0,
      "cached": 0,
      "numRequests": 0
    }
  },
  "id": "29176d50-dc54-4f5a-88f0-c88dd8c044f3",
  "latencyMs": 7,
  "namedScores": {},
  "prompt": {
    "raw": "Write a tweet about bananas",
    "label": "Write a tweet about {{topic}}"
  },
  "promptId": "add16627d8dbb348b8b3ac175c8b96107d26a4b08b5be0262962f8ec5b18ec9e",
  "promptIdx": 0,
  "provider": {
    "id": "openrouter:google/gemini-2.5-flash-lite",
    "label": ""
  },
  "response": {
    "output": "Here are a few options for a tweet about bananas, choose the one that best fits your vibe!\n\n**Option 1 (Simple & Sweet):**\n\n> Just a friendly reminder that bananas are nature's perfect snack. 🍌 Delicious, convenient, and packed with goodness! #banana #healthysnack #fruit\n\n**Option 2 (Playful & Fun):**\n\n> Officially declaring today \"Banana Appreciation Day\"! 🤩 Who else is a huge fan of this amazing yellow fruit? Let's go bananas! 🤪 #banana #fruity #love\n\n**Option 3 (Focus on Benefits):**\n\n> Feeling that afternoon slump? Reach for a banana! ⚡️ Great for energy and a good source of potassium. Your body will thank you. 🙏 #banana #energyboost #potassium #healthy\n\n**Option 4 (Short & Punchy):**\n\n> Banana vibes. 🍌 Simple perfection. #banana\n\n**Option 5 (Engaging Question):**\n\n> What's your favorite way to eat a banana? Smoothie, plain, or baked? 🤔 I'm curious! 👇 #banana #foodie #snackideas\n\n**Remember to add a banana emoji (🍌) for extra visual appeal!**",
    "tokenUsage": {
      "cached": 259,
      "total": 259
    },
    "cached": true,
    "finishReason": "stop"
  },
  "score": 1,
  "success": true,
  "testCase": {
    "vars": {
      "topic": "bananas"
    },
    "assert": [],
    "options": {},
    "metadata": {}
  },
  "testIdx": 0,
  "vars": {
    "topic": "bananas"
  },
  "metadata": {
    "_promptfooFileMetadata": {}
  },
  "failureReason": 0
}

// ...

Each result uses an UUID v4 ID
Prompt IDs are generated from SHA256 calculation of the raw Prompt data (template version)
Prompts are also available as individual objects
Everything is coupled together (prompt, response, score)

0 replies

reisepass · 2025-11-07T14:52:55Z

reisepass
Nov 7, 2025
Maintainer Author

https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/evaluator.py

Location: lm_eval/evaluator.py:634-659

Each sample saved to disk has this structure:

{
"doc_id": int, # Document ID
"doc": dict, # Original input document
"target": str, # Expected answer/target
"arguments": dict, # Prompts (converted to {"gen_args_0": {...}, ...})
"resps": list, # Raw model responses
"filtered_resps": list, # Processed/filtered responses
"filter": str, # Filter name used
"metrics": list[str], # List of metric names
"doc_hash": str, # Hash of document
"prompt_hash": str, # Hash of prompt
"target_hash": str, # Hash of target
# Plus individual metric scores (e.g., "acc": 0.95, "f1": 0.88)
}

Saved to: samples_{task_name}_{timestamp}.jsonl files

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Score schema #28

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Your test cases and metrics setup here...

Evaluate with file output

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Better Score schema #28

Uh oh!

reisepass Nov 7, 2025 Maintainer

Replies: 3 comments · 1 reply

Uh oh!

reisepass Nov 7, 2025 Maintainer Author

Uh oh!

Uh oh!

reisepass Nov 7, 2025 Maintainer Author

Your test cases and metrics setup here...

Evaluate with file output

Uh oh!

mkaramuk Nov 7, 2025 Maintainer

Uh oh!

reisepass Nov 7, 2025 Maintainer Author

reisepass
Nov 7, 2025
Maintainer

Replies: 3 comments 1 reply

reisepass
Nov 7, 2025
Maintainer Author

reisepass Nov 7, 2025
Maintainer Author

mkaramuk
Nov 7, 2025
Maintainer

reisepass
Nov 7, 2025
Maintainer Author