Skip to content

Latest commit

 

History

History
394 lines (300 loc) · 17.9 KB

File metadata and controls

394 lines (300 loc) · 17.9 KB

Creating Custom Targets

A Target is a Python script that acts as a bridge between Spikee and the system you want to test. While you can use targets to interact directly with an LLM's completion endpoint, their primary power lies in testing LLM applications.

You need a custom target when you want to:

  • Test a complete application workflow (e.g., submitting a form, sending an email).
  • Interact with a proprietary API that uses an LLM in its backend.
  • Connect to a new or unsupported LLM provider.
  • Evaluate a specific guardrail system in isolation.

This guide covers how to build a custom target for any of these scenarios. Sample targets can be found within the workspace/targets/ directory, created by running spikee init. Further information about built-in targets and usage examples can be found in Built-in Targets.

Target Structure

Every target is a Python module located in the targets/ directory of your workspace. Spikee identifies targets by their filename.

Target Template

from spikee.templates.target import Target
from spikee.tester import GuardrailTrigger, RetryableError
from spikee.utilities.enums import ModuleTag
from spikee.utilities.hinting import Content, TargetResponseHint, ModuleDescriptionHint, ModuleOptionsHint
from typing import Optional, Dict, List, Tuple, Union, Any
import requests

class ExampleTarget(Target):
    def get_description(self) -> ModuleDescriptionHint:
        return [ModuleTag.SINGLE], "Example Target Template"

    def get_available_option_values(self) -> ModuleOptionsHint:
        """Return supported attack options; Tuple[options (default is first), llm_required]"""
        return [], False

    def process_input(
        self,
        input_text: Content,
        system_message: Optional[Content] = None,
        target_options: Optional[str] = None,
    ) -> TargetResponseHint:
        """Sends prompts to the defined target

        Args:
            input_text (Content): User Prompt
            system_message (Optional[Content], optional): System Prompt. Defaults to None.
            target_options (Optional[str], optional): Target options. Defaults to None.

        Returns:
            Content | bool | Tuple[Content | bool, Any]: Response from the target (text response | guardrail result | boolean for guardrail)

            throws tester.GuardrailTrigger: Indicates guardrail was triggered
            throws Exception: Raises exception on failure
        """
        # ... Define your target implementation here ...

        # Example that creates a get request to an API, and raises a GuardrailTrigger if a 400 status code is returned (indicating the guardrail was triggered)
        try:
            response = requests.get(
                "https://reversec.com/api/example",
                data=input_text
            )
            response.raise_for_status()
            return response.text

        except requests.exceptions.RequestException as e:
            if response.status_code == 400:  # Guardrail Triggered
                raise GuardrailTrigger(f"Guardrail was triggered by the target: {e}")

            else:
                print(f"Error during HTTP request: {e}")
                raise

# Try using `python ./targets/example_target.py` to test your target implementation
if __name__ == "__main__":
    target = ExampleTarget()
    print(target.process_input("Hi i'm Spikee, nice to meet you!"))

The process_input Function

This is the core function that Spikee calls for every test case - it receives a dataset entry and returns the target's response.

Parameters

  • input_text: Content: The user prompt / dataset entry generated by Spikee. For text-based targets this is a plain string. Multimodal targets may receive an Audio or Image object — use get_content(input_text) from spikee.utilities.hinting to extract the raw value if needed.

  • system_message: Optional[Content]: The system prompt, if specified in the dataset. When testing an application, you will likely ignore this parameter, as you typically cannot control the application's internal system prompt. It is mainly used when testing a standalone LLM.

  • target_options: Optional[str]: A string passed from the command line via --target-options.

Return Values

The process_input function's return type depends on what you are testing.

  • For LLM Applications or Models: Return the final text response as a string.
  • For Guardrail Systems: Return a boolean indicating if the payload was allowed.
    • True signifies the guardrail was bypassed (an attack success).
    • False signifies the payload was blocked (an attack failure). This is essential for calculating performance metrics.

Supporting Target Options

To make your target more flexible, you can advertise its supported target_options by implementing the get_available_option_values function. By default, it should return None, indicating no options are supported.

# Basic Implementation
from typing import List, Tuple, Union, Any
from spikee.utilities.hinting import Content, TargetResponseHint, ModuleOptionsHint
from spikee.utilities.modules import parse_options # Utility function to parse target_options string into a dictionary

def get_available_option_values(self) -> ModuleOptionsHint:
    """Return supported attack options; Tuple[options (default is first), llm_required]"""
    return ["mode=default_option", "additional_option1", "additional_option2"], False

def process_input(
        self,
        input_text: Content,
        system_message: Optional[Content] = None,
        target_options: Optional[str] = None,
    ) -> TargetResponseHint:
    options = parse_options(target_options)
    mode = options.get("mode", "default_option")
# Basic Implementation
from typing import List, Tuple, Union, Any, Dict, Optional
from spikee.utilities.hinting import Content, TargetResponseHint, ModuleOptionsHint
from spikee.utilities.modules import parse_options # Utility function to parse target_options string into a dictionary

_OPTIONS_MAP: Dict[str, str] = {
    "example1": "https://reversec.com/api/example1",
    "example2": "https://reversec.com/api/example2",
}
_DEFAULT_KEY = "example1"

def get_available_option_values(self) -> ModuleOptionsHint:
    """Return supported attack options; Tuple[options (default is first), llm_required]"""
    options = ["mode=" + self._DEFAULT_KEY]
    options.extend([key for key in self._OPTIONS_MAP if key != self._DEFAULT_KEY])
    return options, False

def process_input(
        self,
        input_text: Content,
        system_message: Optional[Content] = None,
        target_options: Optional[str] = None,
    ) -> TargetResponseHint:
    options = parse_options(target_options)
    mode = options.get("mode", "default_option")

    if mode in self._OPTIONS_MAP:
        mode = self._OPTIONS_MAP[mode]
    else:
        valid = ", ".join(self.get_available_option_values()[0])
        raise ValueError(f"Unknown option value '{mode}'. Valid options: {valid}")

When this function is present, spikee list targets will display the available options, making your target easier to use.

Error Handling

  • Invalid Options: If your target uses target_options, validate the input and raise a ValueError on invalid values to prevent misconfigured tests.
  • API Calls: Wrap all external API calls in a try...except block. If an exception occurs, log it and re-raise the exception. This allows Spikee's main testing loop to catch the error and apply its retry logic (--max-retries).
  • Custom Errors: Use built-in Spikee exceptions from spikee.tester for the following cases:
    • Guardrail Triggers: Guardrail is triggered, raise a GuardrailTrigger(msg, categories: Dict[str, Any]) exception. This informs Spikee that the payload was blocked, allowing it to log the result correctly.
    • Rate Limiting / Throttling: If the target returns a 429 or similar transient error, raise a RetryableError(msg, retry_period=60) exception. Spikee will back off and retry automatically, respecting the retry_period in seconds.
# Guardrail Trigger Example
from spikee.tester import GuardrailTrigger, RetryableError
import requests

try:
    # Example external API call
    response = requests.get(
        "https://reversec.com/api/example",
        data=input_text
    )
    response.raise_for_status()
    return response.text

except requests.exceptions.RequestException as e
    if response.status_code == 400:  # Guardrail Triggered - HTTP Status code will vary by provider/application
        raise GuardrailTrigger(f"Jailbreak Guardrail Detection", categories={"jailbreak": True})

    else:
        print(f"Error during HTTP request: {e}")
        raise

Multi-Turn Dynamic Targets

Spikee supports multi-turn attacks, using compatible attack and target scripts, which enable Spikee to assess conversational LLM applications against multi-turn prompt injection attacks.

As part of this the MultiTarget and SimpleMultiTarget parent classes have been implemented, which includes built-in support for managing conversation ID and history storage in a multiprocessing safe way.

Common Concepts Explained:

  • Spikee Session ID: This is a UUID generated by a Spikee attack script to uniquely identify a multi-turn attack entry.
  • Target/Application Session ID: This is the identifier used by the target application to track the chat session. Its format (e.g., UUID, integer, string) is determined by the specific target implementation.

MultiTarget Functions Explained:

  • _get_target_data(identifier): Retrieves stored data for a given ID.
  • _update_target_data(identifier, data): Updates stored data for a given ID.

(NB, Please ensure that you call _update_target_data after modifying any retrieved data to ensure changes are saved.)

SimpleMultiTarget Functions Explained: SimpleMultiTarget builds on MultiTarget by providing simplified conversation management and ID mapping for common use-cases..

  • _get_conversation_data(session_id): Retrieves the conversation data for a given session ID.
  • _update_conversation_data(session_id, conversation_data): Updates the conversation data for a given session ID.
  • _append_conversation_data(session_id, role, content): Appends a message to the conversation data for a given session ID.
  • _get_id_map(spikee_session_id): Obtains the mapping of Spikee session IDs to target session IDs.
  • _update_id_map(spikee_session_id, associated_ids): Updates the mapping of Spikee session IDs to target session IDs.

Backtracking Support

Backtracking refers to the ability to "undo" the last turn in a conversation. This is crucial for certain multi-turn attacks (e.g., Crescendo) which rely on removing failed attempts (refusals) from the conversation history to prevent the LLM from entering a defensive state.

Configuration: To assert that your target supports backtracking, set backtrack=True in the __init__ method:

super().__init__(turn_types=[Turn.MULTI], backtrack=True)

Implementation: When process_input is called with backtrack=True, your target must remove the last pair of user and assistant messages from its stored history before processing the new input_text.

Fallback Behavior: If a target does not support backtracking (backtrack=False), attacks like Crescendo will automatically abort the current attempt upon refusal and restart with a fresh session ID, ensuring a clean state for the next attempt.

Example Function Implementations

# Generic Data Operations
# This example shows the target dict being used to store session data.

target_data = self._get_target_data(spikee_session_id)

target_data.append({"role": "user", "content": "How does Spikee work?"})

self._update_target_data(spikee_session_id, target_data) # Please ensure to call update after modifying data, to save changes.


# Simplified Conversation Operations
# This example shows how to use the simplified conversation functions to manage chat history.
session_id = uuid.uuid4()
self._update_conversation_data(session_id, [{"role": "user", "content": "Hello!"}]) # Create a new conversation for the session ID
self._append_conversation_data(session_id, "assistant", "Hi there! How can I help you today?") # Append a message to the conversation
self._get_conversation_data(session_id)  # Retrieve the conversation data for the session ID


# Simplified ID Mapping Operations
spikee_session_id = uuid.uuid4()
self._update_id_map(spikee_session_id, ["application-id-20"])  # Map Spikee session ID to target/application session ID
associated_ids = self._get_id_map(spikee_session_id)  # Retrieve the mapped target/application session IDs

See workspace/targets/test_chatbot.py for an example implementation of a MultiTarget target that manually manages session state and history. See workspace/targets/simple_test_chatbot.py for an example implementation of a SimpleMultiTarget that simplifies this process.

Multi-Turn Target Template

import uuid
from typing import List, Tuple, Union, Any, Optional

from spikee.templates.multi_target import MultiTarget
from spikee.utilities.enums import Turn, ModuleTag
from spikee.utilities.hinting import Content, TargetResponseHint, ModuleDescriptionHint, ModuleOptionsHint


class SampleMultiTurnTarget(MultiTarget):
    def __init__(self):
        super().__init__(
            # Specify that this target supports both single-turn and multi-turn interactions (Target Default is SINGLE only, MultiTarget default is MULTI only)
            turn_types=[Turn.SINGLE, Turn.MULTI],  

            # Does the target + target application support backtracking (e.g., editing previous messages in the conversation)
            backtrack=True 
        )

    def get_description(self) -> ModuleDescriptionHint:
        return [ModuleTag.MULTI], "Example Multi-Turn Target Template"

    def get_available_option_values(self) -> ModuleOptionsHint:
        """Return supported attack options; Tuple[options (default is first), llm_required]"""
        return [], False

    def process_input(
        self,
        input_text: Content,
        system_message: Optional[Content] = None,
        target_options: Optional[str] = None,
        spikee_session_id: Optional[str] = None,
        backtrack: Optional[bool] = False,
    ) -> TargetResponseHint:
        # Handle single-turn interactions, assign a random UUIDv4
        if spikee_session_id is None:
            spikee_session_id = "single_turn_" + str(uuid.uuid4()) 

        # Get stored data. `None` will be returned for new IDs.
        target_data = self._get_target_data(spikee_session_id)

        # Create new conversation history
        if target_data is None:
            target_data = {'history': []}

        # Backtracking Logic
        if backtrack and len(target_data['history']) > 2:
            # Remove last turn
            target_data['history'] = target_data['history'][:-2]

            # INCLUDE BACKTRACKING LOGIC
        
        # Query target application
        response = "... SEND PROMPT TO TARGET APPLICATION ..."

        # Add new messages to conversation history
        target_data['history'].append({"role": "user", "content": input_text})
        target_data['history'].append({"role": "assistant", "content": response})

        # Update stored data
        # Please ensure that you call `_update_target_data` after modifying any retrieved data to ensure changes are saved.
        self._update_target_data(spikee_session_id, target_data)

        return response

Simplified Multi-Turn Target Template

import uuid
from typing import List, Tuple, Union, Any, Optional

from spikee.templates.simple_multi_target import SimpleMultiTarget
from spikee.utilities.enums import Turn, ModuleTag
from spikee.utilities.hinting import Content, TargetResponseHint, ModuleDescriptionHint, ModuleOptionsHint


class SampleSimpleMultiTurnTarget(SimpleMultiTarget):
    def __init__(self):
        super().__init__(
            turn_types=[Turn.SINGLE, Turn.MULTI],  
            backtrack=True 
        )

    def get_description(self) -> ModuleDescriptionHint:
        return [ModuleTag.MULTI], "Example Simple Multi-Turn Target Template"

    def get_available_option_values(self) -> ModuleOptionsHint:
        """Return supported attack options; Tuple[options (default is first), llm_required]"""
        return [], False

    def process_input(
        self,
        input_text: Content,
        system_message: Optional[Content] = None,
        target_options: Optional[str] = None,
        spikee_session_id: Optional[str] = None,
        backtrack: Optional[bool] = False,
    ) -> TargetResponseHint:
        target_session_id = None
        if spikee_session_id is None:
            # Handle single-turn interactions, assign a random UUIDv4
            target_session_id = "single_turn_" + str(uuid.uuid4()) 
        
        else:
            # Get mapped target session ID, for multi-turn interactions.
            target_session_id = self._get_id_map(spikee_session_id)

            # If no mapping exists, obtain new ID.
            # Implementation will vary for target application.
            if target_session_id is None:
                target_session_id = " ... IMPLEMENTATION-SPECIFIC SESSION ID ... "
        
        # Backtracking Logic
        if backtrack and spikee_session_id is not None:
            history = self._get_conversation_data(spikee_session_id)

            if history is not None and len(history) > 2:
                # Remove last turn
                history = history[:-2]

                # INCLUDE BACKTRACKING LOGIC

                self._update_conversation_data(spikee_session_id, history)

        # Query target application
        response = "... SEND PROMPT TO TARGET APPLICATION ..."

        if spikee_session_id is not None:
            self._append_conversation_data(spikee_session_id, role="user", content=input_text)
            self._append_conversation_data(spikee_session_id, role="assistant", content=response)

        return response