Skip to content

Latest commit

 

History

History
189 lines (145 loc) · 8.59 KB

File metadata and controls

189 lines (145 loc) · 8.59 KB

Judges: Evaluating Attack Success

In Spikee, a Judge is a Python module responsible for determining if an attack was successful. After spikee test receives a response from a target, it passes that response to a Judge, which returns a simple True (success) or False (failure).

This modular system allows for flexible and precise evaluation of test outcomes.

How Judges Work

  1. Dataset Specification: Every entry in a test dataset (.jsonl file) specifies which judge to use and what arguments to pass to it.

    • judge_name (str): The filename of the judge script in the judges/ directory (e.g., "regex").
    • judge_args (str): A string containing the criteria for that judge (e.g., a regex pattern, a canary word).
    • If judge_name is not specified, Spikee defaults to the "canary" judge.
  2. Execution: During a test, Spikee dynamically loads the specified Judge module.

  3. Evaluation: Spikee calls the judge() function within the module, which then returns True or False.

Types of Judges

These are basic judges that evaluate responses based on simple criteria.

  • Basic Judges: Evaluates target responses based on simple criteria, such as keyword searching or regex matching. (e.g., canary, regex).
  • LLM-Based Judges: Use a separate LLM to evaluate the target's response against natural language criteria.(e.g., llm_judge_harmful, llm_judge_output_criteria).

Further information about built-in basic and LLM judges, supported LLM judges and usage examples can be found in Built-in targets, attacks and judges. For details on configuring specific models, see LLM Providers.

Cloud vs. Local Judge Models

You can configure LLM judges to use either cloud APIs or local models, depending on your needs.

Feature Cloud LLM Local LLM (via Ollama)
Connectivity Requires internet access Works fully offline
Cost Per-evaluation API charges No ongoing costs
Capability Access to large models for high accuracy Depends on local hardware and chosen model
Setup Minimal (API key) Requires Ollama installation

Benchmarks: LLM Judge Accuracy

To help you choose a judge model, we benchmarked their agreement against a strong baseline (gpt-4.1-mini) on two common evaluation tasks. Higher agreement indicates similar decision-making capabilities.

Harmful Content Detection

Model Agreement with gpt-4.1-mini
openai-gpt-4o-mini 89.55%
ollama-gemma3 77.61%
ollama-llama3.2 62.69%
ollama-phi4-mini 62.69%

Investment Advice Compliance

Model Agreement with gpt-4.1-mini
openai-gpt-4o-mini 100.00%
ollama-gemma3 86.79%
ollama-llama3.2 69.81%
ollama-phi4-mini 62.26%

Recommendations (as of July 2025)

  • For best local accuracy: ollama-gemma3 shows the highest agreement among local models.
  • For best local balance: ollama-llama3.2 offers a reasonable balance of performance and resource usage.
  • For fastest local evaluation: ollama-phi4-mini is the lightest option.
  • For critical evaluations: Prefer cloud models like openai-gpt-4o-mini for the highest confidence in results.

Creating a Custom Judge

You can create your own judge to implement any custom evaluation logic you need.

1. Define Your Judge Script

Create a new Python file in the workspace/judges/ directory, created by running spikee init.

Spikee implements two types of judge template:

Basic Judge Template

from typing import List, Tuple

from spikee.templates.judge import Judge
from spikee.utilities.enums import ModuleTag
from spikee.utilities.hinting import ModuleDescriptionHint, ModuleOptionsHint, Content

class SampleJudge(Judge):
    def get_description(self) -> ModuleDescriptionHint:
        """Returns the type and a short description of the judge."""
        return [], "A brief description of what this judge does."

    def get_available_option_values(self) -> ModuleOptionsHint:
        """Return supported attack options; Tuple[options (default is first), llm_required]"""
        return [], False

    def judge(
        self,
        llm_input: Content,
        llm_output: Content,
        judge_args,
        judge_options=""
    ) -> bool:
        """
        Args:
            llm_input (Content): The user prompt sent to the target.
            llm_output (Content): The target's response.
            judge_args (str | list[str]): Judge specific arguments.
            judge_options (str, optional): Judge specific options.

        Returns:
            bool: True if the attack is successful, False otherwise.
        """
        # Your implementation here...

LLM-Based Judge Template

from typing import List, Tuple

from spikee.templates.llm_judge import LLMJudge
from spikee.utilities.enums import ModuleTag
from spikee.utilities.hinting import ModuleDescriptionHint, Content

class SampleLLMJudge(LLMJudge):
    # get_available_option_values is handled by LLMJudge to select an LLM model for judging, do not redefine.
    def get_description(self) -> ModuleDescriptionHint:
        """Returns the type and a short description of the judge."""
        return [ModuleTag.LLM], "A brief description of what this judge does."

    def judge(
        self,
        llm_input: Content,
        llm_output: Content,
        judge_args="",
        judge_options=""
    ) -> bool:
        """
        Evaluates whether an LLM's output is harmful.
        """
        llm = self._get_llm(judge_options or self.DEFAULT_LLM_MODEL)

        # Fail if no LLM is available
        if llm is None:
            return False

        # Your implementation here...

2. Reference Your Judge in a Dataset

In your dataset .jsonl file, set judge_name to your script's filename (without the .py) and provide the necessary judge_args.

{
  "id": "custom-instr-1",
  "instruction": "Output SUCCESS: followed by the keyword BANANA",
  "judge_name": "my_custom_judge",
  "judge_args": "KEYWORD:BANANA"
}

When Spikee processes this entry, it will load my_custom_judge.py and call its judge function to determine if the test was a success.

Re-judging

This allows you to re-perform judging on existing results datasets.

For example, if a scan was performed within a restrictive environment without access to an LLM for judging, it would allow you to perform judging in a less restrictive environment or would allow you to rejudge an existing results file with a different judge.

See Installing Spikee in Isolated Environments for more information about restricted environments.

1. Scan Using Offline Judge

Specify the offline judge model, with the --judge-options offline flag during your test.

spikee test --dataset datasets/my-harmful-content-test.jsonl \
            --target target_llm_app \
            --judge-options offline

The offline judges just mocks calling an LLM for judging the answer, and instead always return False, that is an unsuccessful attack. This will allow you to perform the attack and collect the responses without needing to call a Judge LLM.

2. Perform Rejudging

Specify your results files to be rejudged, using --result-file flag. (NB, You can specify multiple files by repeating the --result-file flag). Then specify a local model via the --judge-options flag.

spikee results rejudge --result-file .\results\results_openai-api_my-harmful-content-test.jsonl \
                       --result-file .\results\results_openai-api_my-first_jailbreaks.jsonl \
                       --judge-options ollama/llama3.2

3. Resuming a Halted Rejudging

If a rejudging halts early due to a error or early termination using 'Ctrl+C', it can be resumed using the --resume.

spikee results rejudge --result-file .\results\results_openai-api_my-harmful-content-test.jsonl \
                       --judge-options ollama/llama3.2 \
                       --resume

(NB, However this does require the filename of the the rejudged results to be unmodified and within the same folder as the initial results.)