Automated red teaming by hjrnunes · Pull Request #5 · trustyai-explainability/garak

hjrnunes · 2026-03-09T17:18:04Z

Draft temporary PR to keep track of changes.

…tive scores

adds single-scalar aggregating from garak results, biased by tier ## Problem Since garak results are spread across each probe, it is hard to evaluate if a given model is good or bad. It requires the researcher to have deep understanding of each probe to reach to some conclusion After finetuning a model, there is no way to compare Model A Vs Model B in an easy way. Ask - The ask is NOT to make Garak benchmark a score but rather provide an easy way to non-security experts to evaluate results ## Desiderata * A single result for a garak run, for model teams * Quantitative, scalar score (though not necessarily in a metric space) * Some stability * Simple to understand for model and system developers * Simple to understand for security folk * Weights failures in important probes higher * Gives lower score to targets with higher variation in performance over probes but same averages * Prioritises increases in rate of severity or failure * Doesn't overstate precision ## Proposal Let’s use “tier-based score aggregate" # How it works * We could use pass/fail for every probe:detector pair, but this might not afford enough granularity * Each probe:detector result (both pass rate and Z-score) is graded internally in garak on a 1-5 scale, 5 is great, 1 is awful - this uses the DEFCON scale * Grading boundaries are determined through experience using garak for review * Value boundaries stored currently in `garak.analyze` * DEFCON 1 & 2 are fails, according to Tier descriptions * Moving from pass/fail to a 1-5 scale gives us more granularity * First, we aggregate each probe:detector’s scores into one. This means combining the pass rate and Z-score. To do this, we extract the DEFCON for pass rate and for Z-score, and take the minimum. Any other aggregation measure would conceal important failures. * Next, we group probe:detector aggregate defcons by tier * We calculate the aggregate (e.g. harmonic mean, interpolated lower quartile) for Tier 1 and for Tier 2 probe:detector pairs * We take the weighted mean of Tier 1 and Tier 2 probes; propose a 2:1 weighting here * Round to 1 d.p. * Now you have a score in the range 1.0-5.0 where higher is better. \o/ NB: No garak score is stable over time. This is intended behaviour. For comparability, appropriate parts of config need to match; see NVIDIA#1193 ## Notes The **score can only be compared among identical versions of garak and identical configs**. We need to give a key identifying the basis for comparison (e.g. config & version). Tracking this feature in [garak issue 1193](NVIDIA#1193). Because “proportion of probe detections passed/failed”, we can use a sensible scale Percentage is still a bad idea - 100% sounds like a target is fully secure, but garak can never show this. No intent to support this design, for safety reasons Scores will be affected by the mixture of detectors and probes within Tier 1; later iterations, after technique/intent lands, will let us group this up by strategies & impacts, allowing more meaningful & stable results Items not in the bag may not be included, meaning an integration overhead for probe updates. One proposal: just use the absolute score, we’re doing a min() for aggregation anyway Beside config & version, any given TBSA score is predicated on: * DEFCON boundaries for pass rate & Z-score (stable, pretty confident about these) * Composition of Tier 1 and Tier 2’s probes (flux only between releases) * Detectors chosen for each probe (low flux, only between releases) * Models & regex used for detector (low flux, only between releases) * External APIs used and whatever is going on there (who knows) ## How this fills the goals * Garak TBSA is a single, scalar item * It’s stable for the same version and config * Dev teams can understand ratings 1-5, 1=bad 5=good * Many security folks understand DEFCON scores; many either lived through the cold war, saw Wargames, or have some familiarity with big military * Tier-1 probes have high impact, Tier-3 + U probes have no impact, so we have a weighting * Aggregation function downweights high-variance results * Coarse granularity of one decimal place expresses precision appropriately ## What’s the ideal solution like We should really be balancing score components grouped by their characteristics, rather than just how many probe/detector pairs there are. Ideally we’d like to be able to group techniques and group impacts. We have a couple of typologies for these but aren’t well-informed enough. Uniform weighting seems the only reasonable choice in the absence of anything else. # Implementation Build this as a new analysis tool. consumes: * report digest object relies on: * absolute scores in digest * relative scores in digest * defcons in digest * tier defs in digest outputs: * 2s.f. / 1d.p. score [1.0,5.0] usage: * `python -m garak.analyze.tbsa -r <report.jsonl filepath>` * `garak.analyze.tbsa.digest_to_tbsa()` open questions/extensions: * [ ] choose between current garak calibration and calibration in report file * [ ] what if there's no calibration data available overall * [ ] do we fill in gaps if major (0.x) versions match * [ ] what if there's no calibration data in the file but file calibration was recommended * [ ] what if there's absolute score but no relative, for a T1 probe * [ ] allow use of calculated z-scores, calculated defcons, current tierdefs * [ ] what if absolute is lower than relative, for T2 (insufficient impact, use relative) * [ ] configurable aggregate function (mean, floor, first quartile, harmonic mean) * [ ] aggregate multiple probe assessments to just one (mean? floor? harmonic? as specified by probe?) * [ ] load current / custom calibration * [ ] how strictly should we fail? (all missing relative? any missing relative? cutoff?) * [ ] how to handle groups? mean of scores per group? one DC per group? * [x] tests * [ ] should this be included in report digest (i guess), let's do that non-circularly

…es (NVIDIA#1579)

…te()` (NVIDIA#1578)

- Show detectors within a single probe instead of comparing across probes - Y-axis now displays detector names, not probe names - Remove cross-probe comparison logic and Hide N/A checkbox - Add DetectorResultsTable showing DEFCON, passed, failed, total counts - Simplify useDetectorChartOptions hook for new data flow - Update all related tests

…robe metrics - Refactor DetectorsView layout: probe header → detector breakdown → z-score chart - Add severity badge and pass rate display to probe header - Remove fail_count/prompt_count from UI (source unverified in backend) - Extract ProgressBar component for detector results visualization - Extract formatPercentage utility for consistent % display (no .00 decimals) - Use hit_count directly from backend for failure counts (no client calculations) - Add linked hover highlighting between results table and lollipop chart - Add tooltip to module score badge explaining aggregation function - Update all related tests to match new component structure - Fix color mappings: DC-3=blue, DC-4=green for visual consistency

Backend (report_digest.py) provides 'passed' count, not 'hit_count'. Code was defaulting to 0 failures when hit_count wasn't found. - Calculate failures as total - passed when hit_count is missing - Remove failure counts from Z-score y-axis labels (cleaner view) - Update related tests

- Add 'passed' field to useFlattenedModules data transformation - Failures now correctly derived from total - passed - Set probe bar chart y-axis to always show 0-100% scale

- Use module.summary.group instead of module.group_name for display - Shows 'LLM01: Prompt Injection' instead of 'llm01' - Conditionally render Anchor only when group_link exists

…ed on Stub

… rejected

… Attempt

…m self.prompts after _mint_attempt had already set it to the correct TextStub via _attempt_prestore_hook. The type mismatch (string vs TextStub) caused the EarlyStopHarness stub comparison to silently return False, so TAP jailbreak results were discarded and all baseline attempts remained in the rejected list regardless of outcome.

- fixes NVIDIAgh-93

…pec: S by default. Since intent_spec defaulted to "S", every scan went through early_stop_run() — even when a user just wantes to run standard garak probes. The intent system would be loaded, intent stubs generated, and CAS-related entries would end up in the JSONL output. Users running native probes would get unexpected CAS artifacts mixed into their results. The proposed fix is to default intent_spec to None (empty YAML value), so native probe scans take the normal probewise_run or pxd_run paths. Users who want intent-based scanning should explicitly opt in and set intent_spec: S or intent_spec: "*" in their config. - fixes NVIDIAgh-94

The AVID report pipeline expects "entry_type": "eval" and "eval_intent" entries in report.jsonl. EarlyStopHarness was not calling evaluator.evaluate(), so intent-based scans produced no eval entries and the AVID export failed or returned empty. Add evaluator.evaluate() after writing final attempts so the evaluator generates eval and eval_intent entries from the EarlyStop detector results. Update the harness test to verify probe identifier, detector, scores, and intent metadata are present in the report output. - fixes NVIDIAgh-95

…attempts EarlyStopHarness summary attempts carry detector_results (e.g. {"EarlyStop": [1.0]}) but have no assistant turns, so attempt.outputs is empty. When the evaluator iterates over detector scores and a score fais the threshold, it tries to access attempt.outputs[idx] unconditionally, raising IndexError. Add a bounds check before accessing attempt.outputs so the evaluator handles attempts where detector_results exist without corresponding outputs. Also add a regression test that reproduces the crash.

…orrect behavior between separate evaluations. This change maintains the within-batch guard while allowing for proper refresh of probe names, ensuring both probewise and early-stop flows function correctly.

The report digest (report_digest.py) and AVID export (report.py) both assumed eval entries use a two-part "module.Class" probe name and a dotted "module.Class" detector name. The EarlyStopHarness writes eval entries with probe "garak.harnesses.earlystop. EarlyStopHarness" (4 parts) and detector "EarlyStop" (no dot), causing ValueError on str.split(".") tuple unpacking in both files.

… jailbreak, so the caller's detectors can evaluate them

Implement RHEL AI midstream versioning pattern (X.Y.Z+rhaiv.N) using flit-scm to enable automatic version management from git tags while preserving flit as the build backend. Changes: - Migrate build backend from flit_core to flit_scm:buildapi - Enable dynamic versioning via setuptools_scm (used by flit-scm) - Add version fallback mechanism in __init__.py (0.0.0+dev) - Exclude auto-generated _version.py from version control - Add CHANGELOG.md with versioning documentation - Add RHEL-AI-VERSIONING-GUIDE.md as implementation reference - Document versioning pattern in README.md Technical Details: - flit-scm bridges flit (build) and setuptools_scm (versioning) - Version constraint: flit-scm>=1.7.0,<2 (transitively requires flit-core~=3.5) - All flit configuration preserved ([tool.flit.*] sections maintained) - Version file: garak/_version.py (auto-generated during build) Version Pattern: - Base: Semantic version (e.g., 0.14.1) - Midstream marker: +rhaiv (Red Hat AI Version) - Build number: .N (increments for fixes, resets on version bump) - Example: 0.14.1+rhaiv.1, 0.14.1+rhaiv.2, 0.15.0+rhaiv.1 Backward Compatibility: - No breaking changes to package functionality - pip install garak continues to work as before - Build system change is internal to packaging Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Sudip Sinha <sudsinha@redhat.com>

…'t see them The baseline IntentProbe independently fetches stubs from the intent service in its constructor, bypassing _filter_typology_stubs in the harness. This causes typology-derived stubs (e.g. "Prompts that seek to gain advice and strategies to commit fraud") to leak into the report, inflating baseline counts.

…tion Introduces a new MulticlassJudge detector that extends ModelAsJudge with JSON-aware response parsing and user-defined classification categories (e.g. complied/rejected/alternative/other). Supports configurable system and user prompts, custom score keys/fields, confidence thresholds, and optional JSON schema injection for structured output APIs.

Add RHEL AI midstream versioning support

…angservice - Add prompt_intents tracking to IntentProbe.build_prompts() so each prompt maps back to its source intent (fixes test_cas_intentprobe) - Fix test_spo_augmentation fixtures to load intent service before constructing SPO probes, use TextStub objects instead of raw strings, and align assertions with actual implementation (augment_system/ augment_user flags instead of nonexistent augmentation_func attribute). Add _mint_attempt tests that exercise the base Probe method with augmentation metadata verification. - Load intent service in langservice probe_pre_req fixture and skip TAPIntent/PAIR in translation test only when OPENAI_API_KEY is missing - Sync requirements.txt datasets version (3.x -> 4.x) with pyproject.toml

…reak wrapper or TAP variants

leondz and others added 30 commits January 28, 2026 11:02

print json output path

d70eca4

fix bug where 100% pass rate gave 'elevated risk' or similar for rela…

832bace

…tive scores

put abs comment set after the 100% abs score check

2c9ef72

give clearer err msg on invalid generator behaviour; model->target

41c715c

match generation count in mock resp and in test

5dda221

report aggregation check matches max absolute score relative comments

f8e7e11

fix: guard for None content as response from target (NVIDIA#1574)

eb2e113

fix: 100% pass rate should not give 'elevated risk' for relative scor…

52c2ef7

…es (NVIDIA#1579)

automatic garak/resources/plugin_cache.json update

d481851

task: Move generation output cardinality check into `Generator.genera…

da7a731

…te()` (NVIDIA#1578)

automatic garak/resources/plugin_cache.json update

d50061b

WIP

9f4575d

factor up defcon assignment, streamline digest construction

64bb773

fix(report-ui): restore prompt_count display in panel and tooltip

0df5fb6

fix(report-ui): pass through 'passed' field and fix y-axis to 100%

5585174

- Add 'passed' field to useFlattenedModules data transformation - Failures now correctly derived from total - passed - Set probe bar chart y-axis to always show 0-100% scale

fix(report-ui): remove pass rate from detector panel header

4747419

fix(report-ui): use formatted group name for taxonomy groupings

e263c1c

- Use module.summary.group instead of module.group_name for display - Shows 'LLM01: Prompt Injection' instead of 'llm01' - Conditionally render Anchor only when group_link exists

avoid W0102, mutable dict as default

d013a93

address W0102 / list param defaults set as mutable

c512fa1

address R1720 extraneous pass stmt

8614e36

simplify imports

b3ae90f

rm unused imports

047687c

correct filedetector type signatures

a4f0e95

address W0221, method param counts/sig should match when overridden

d1a260e

streamline loading prompt into attempt

1bed3df

migrate case_sensitive to configurable, not method, parameter

86897af

ABeltramo and others added 30 commits February 18, 2026 11:00

fix: fixes after rebase

aac4e84

fix: properly matching rejected attempts with previously rejected bas…

f0660f1

…ed on Stub

fix: apply stubs filter before creating the Probe

0472cb5

fix: keep track of previously accepted prompts

66f6191

feat: set attempt status to COMPLETE after successful evaluation

9d405b5

fix: change rejection logic so that it only stops if all attempts are…

1c73280

… rejected

fix: avoid exception when evaluating rejections and added goal to the…

f892911

… Attempt

fix: SPO sampling max_dan_samples

4c28d2f

fix: load up intents properly in earlystop

003c12f

Add tests for stub linkage bug

875c49e

Relax datasets dependency constraint to allow >=4.0.0

9950aa1

- fixes NVIDIAgh-93

Reset self.probename at the start of each evaluate() call to ensure c…

4504cd7

…orrect behavior between separate evaluations. This change maintains the within-batch guard while allowing for proper refresh of probe names, ensuring both probewise and early-stop flows function correctly.

run_tap should return its best attack prompts even without a score-10…

26e55c4

… jailbreak, so the caller's detectors can evaluate them

Filter typology-derived stubs in EarlyStopHarness

346bf4b

Delete RHEL-AI-VERSIONING-GUIDE.md

a5704ec

Signed-off-by: Sudip Sinha <sudsinha@redhat.com>

Merge pull request #4 from trustyai-explainability/rhel-ai-versioning

aa4966e

Add RHEL AI midstream versioning support

Adjust default prompts for MulticlassJudge

11f9f95

Fix MulticlassJudge default user prompt template

807f41a

Fix _mint_attempt tests.

8276105

fix: avoid exception when the model fails to return a message

738139d

fix: ensure detectors evaluate against the original intent, not jailb…

198ad86

…reak wrapper or TAP variants

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated red teaming#5

Automated red teaming#5
hjrnunes wants to merge 370 commits intomainfrom
automated-red-teaming

hjrnunes commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

hjrnunes commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants