Skip to content

Automated red teaming#5

Draft
hjrnunes wants to merge 370 commits intomainfrom
automated-red-teaming
Draft

Automated red teaming#5
hjrnunes wants to merge 370 commits intomainfrom
automated-red-teaming

Conversation

@hjrnunes
Copy link

@hjrnunes hjrnunes commented Mar 9, 2026

Draft temporary PR to keep track of changes.

leondz and others added 30 commits January 28, 2026 11:02
adds single-scalar aggregating from garak results, biased by tier


## Problem
Since garak results are spread across each probe, it is hard to evaluate
if a given model is good or bad. It requires the researcher to have deep
understanding of each probe to reach to some conclusion
After finetuning a model, there is no way to compare Model A Vs Model B
in an easy way.

Ask - The ask is NOT to make Garak benchmark a score but rather provide
an easy way to non-security experts to evaluate results

## Desiderata

* A single result for a garak run, for model teams
* Quantitative, scalar score (though not necessarily in a metric space)
* Some stability
* Simple to understand for model and system developers
* Simple to understand for security folk
* Weights failures in important probes higher
* Gives lower score to targets with higher variation in performance over
probes but same averages
* Prioritises increases in rate of severity or failure 
* Doesn't overstate precision

## Proposal

Let’s use “tier-based score aggregate"

# How it works

* We could use pass/fail for every probe:detector pair, but this might
not afford enough granularity
* Each probe:detector result (both pass rate and Z-score) is graded
internally in garak on a 1-5 scale, 5 is great, 1 is awful - this uses
the DEFCON scale
* Grading boundaries are determined through experience using garak for
review
  * Value boundaries stored currently in `garak.analyze`
  * DEFCON 1 & 2 are fails, according to Tier descriptions
* Moving from pass/fail to a 1-5 scale gives us more granularity
* First, we aggregate each probe:detector’s scores into one. This means
combining the pass rate and Z-score. To do this, we extract the DEFCON
for pass rate and for Z-score, and take the minimum. Any other
aggregation measure would conceal important failures.
* Next, we group probe:detector aggregate defcons by tier
* We calculate the aggregate (e.g. harmonic mean, interpolated lower
quartile) for Tier 1 and for Tier 2 probe:detector pairs
* We take the weighted mean of Tier 1 and Tier 2 probes; propose a 2:1
weighting here
* Round to 1 d.p.
* Now you have a score in the range 1.0-5.0 where higher is better. \o/

NB: No garak score is stable over time. This is intended behaviour. For
comparability, appropriate parts of config need to match; see NVIDIA#1193


## Notes
The **score can only be compared among identical versions of garak and
identical configs**. We need to give a key identifying the basis for
comparison (e.g. config & version). Tracking this feature in [garak
issue 1193](NVIDIA#1193).

Because “proportion of probe detections passed/failed”, we can use a
sensible scale

Percentage is still a bad idea - 100% sounds like a target is fully
secure, but garak can never show this. No intent to support this design,
for safety reasons

Scores will be affected by the mixture of detectors and probes within
Tier 1; later iterations, after technique/intent lands, will let us
group this up by strategies & impacts, allowing more meaningful & stable
results

Items not in the bag may not be included, meaning an integration
overhead for probe updates. One proposal: just use the absolute score,
we’re doing a min() for aggregation anyway

Beside config & version, any given TBSA score is predicated on:

* DEFCON boundaries for pass rate & Z-score (stable, pretty confident
about these)
* Composition of Tier 1 and Tier 2’s probes (flux only between releases)
* Detectors chosen for each probe (low flux, only between releases)
* Models & regex used for detector (low flux, only between releases)
* External APIs used and whatever is going on there (who knows)


## How this fills the goals

* Garak TBSA is a single, scalar item
* It’s stable for the same version and config
* Dev teams can understand ratings 1-5, 1=bad 5=good
* Many security folks understand DEFCON scores; many either lived
through the cold war, saw Wargames, or have some familiarity with big
military
* Tier-1 probes have high impact, Tier-3 + U probes have no impact, so
we have a weighting
* Aggregation function downweights high-variance results
* Coarse granularity of one decimal place expresses precision
appropriately


## What’s the ideal solution like

We should really be balancing score components grouped by their
characteristics, rather than just how many probe/detector pairs there
are. Ideally we’d like to be able to group techniques and group impacts.
We have a couple of typologies for these but aren’t well-informed
enough. Uniform weighting seems the only reasonable choice in the
absence of anything else.



# Implementation

Build this as a new analysis tool.

consumes:
* report digest object

relies on:
* absolute scores in digest
* relative scores in digest
* defcons in digest
* tier defs in digest

outputs:
* 2s.f. / 1d.p. score [1.0,5.0]

usage:
* `python -m garak.analyze.tbsa -r <report.jsonl filepath>`
* `garak.analyze.tbsa.digest_to_tbsa()`

open questions/extensions:
* [ ] choose between current garak calibration and calibration in report
file
* [ ] what if there's no calibration data available overall
* [ ] do we fill in gaps if major (0.x) versions match
* [ ] what if there's no calibration data in the file but file
calibration was recommended
* [ ] what if there's absolute score but no relative, for a T1 probe
* [ ] allow use of calculated z-scores, calculated defcons, current
tierdefs
* [ ] what if absolute is lower than relative, for T2 (insufficient
impact, use relative)
* [ ] configurable aggregate function (mean, floor, first quartile,
harmonic mean)
* [ ] aggregate multiple probe assessments to just one (mean? floor?
harmonic? as specified by probe?)
* [ ] load current / custom calibration
* [ ] how strictly should we fail? (all missing relative? any missing
relative? cutoff?)
* [ ] how to handle groups? mean of scores per group? one DC per group?
* [x] tests
* [ ] should this be included in report digest (i guess), let's do that
non-circularly
- Show detectors within a single probe instead of comparing across probes
- Y-axis now displays detector names, not probe names
- Remove cross-probe comparison logic and Hide N/A checkbox
- Add DetectorResultsTable showing DEFCON, passed, failed, total counts
- Simplify useDetectorChartOptions hook for new data flow
- Update all related tests
…robe metrics

- Refactor DetectorsView layout: probe header → detector breakdown → z-score chart
- Add severity badge and pass rate display to probe header
- Remove fail_count/prompt_count from UI (source unverified in backend)
- Extract ProgressBar component for detector results visualization
- Extract formatPercentage utility for consistent % display (no .00 decimals)
- Use hit_count directly from backend for failure counts (no client calculations)
- Add linked hover highlighting between results table and lollipop chart
- Add tooltip to module score badge explaining aggregation function
- Update all related tests to match new component structure
- Fix color mappings: DC-3=blue, DC-4=green for visual consistency
Backend (report_digest.py) provides 'passed' count, not 'hit_count'.
Code was defaulting to 0 failures when hit_count wasn't found.

- Calculate failures as total - passed when hit_count is missing
- Remove failure counts from Z-score y-axis labels (cleaner view)
- Update related tests
- Add 'passed' field to useFlattenedModules data transformation
- Failures now correctly derived from total - passed
- Set probe bar chart y-axis to always show 0-100% scale
- Use module.summary.group instead of module.group_name for display
- Shows 'LLM01: Prompt Injection' instead of 'llm01'
- Conditionally render Anchor only when group_link exists
ABeltramo and others added 30 commits February 18, 2026 11:00
…m self.prompts after _mint_attempt had already set it to the correct TextStub via _attempt_prestore_hook. The type mismatch (string vs TextStub) caused the EarlyStopHarness stub comparison to silently return False, so TAP jailbreak results were discarded and all baseline attempts remained in the rejected list regardless of outcome.
…pec: S by default. Since intent_spec defaulted to "S", every scan went through early_stop_run() — even when a user just wantes to run standard garak probes. The intent system would be loaded, intent stubs generated, and CAS-related entries would end up in the JSONL output. Users running native probes would get unexpected CAS artifacts mixed into their results.

The proposed fix is to default intent_spec to None (empty YAML value), so native probe scans take the normal probewise_run or pxd_run paths. Users who want intent-based scanning should explicitly opt in and set intent_spec: S or intent_spec: "*" in their config.

- fixes NVIDIAgh-94
The AVID report pipeline expects "entry_type": "eval" and "eval_intent" entries in report.jsonl. EarlyStopHarness was not calling evaluator.evaluate(), so intent-based scans produced no eval entries and the AVID export failed or returned empty.

Add evaluator.evaluate() after writing final attempts so the evaluator generates eval and eval_intent entries from the EarlyStop detector results. Update the harness test to verify probe identifier, detector, scores, and intent metadata are present in the report output.

- fixes NVIDIAgh-95
…attempts

EarlyStopHarness summary attempts carry detector_results (e.g. {"EarlyStop": [1.0]}) but have no assistant turns, so attempt.outputs is empty. When the evaluator iterates over detector scores and a score fais the threshold, it tries to access attempt.outputs[idx] unconditionally, raising IndexError.

Add a bounds check before accessing attempt.outputs so the evaluator handles attempts where detector_results exist without corresponding outputs. Also add a regression test that reproduces the crash.
…orrect behavior between separate evaluations. This change maintains the within-batch guard while allowing for proper refresh of probe names, ensuring both probewise and early-stop flows function correctly.
The report digest (report_digest.py) and AVID export (report.py)
both assumed eval entries use a two-part "module.Class" probe name
and a dotted "module.Class" detector name. The EarlyStopHarness
writes eval entries with probe "garak.harnesses.earlystop.
EarlyStopHarness" (4 parts) and detector "EarlyStop" (no dot),
causing ValueError on str.split(".") tuple unpacking in both files.
… jailbreak, so the caller's detectors can evaluate them
Implement RHEL AI midstream versioning pattern (X.Y.Z+rhaiv.N) using
flit-scm to enable automatic version management from git tags while
preserving flit as the build backend.

Changes:
- Migrate build backend from flit_core to flit_scm:buildapi
- Enable dynamic versioning via setuptools_scm (used by flit-scm)
- Add version fallback mechanism in __init__.py (0.0.0+dev)
- Exclude auto-generated _version.py from version control
- Add CHANGELOG.md with versioning documentation
- Add RHEL-AI-VERSIONING-GUIDE.md as implementation reference
- Document versioning pattern in README.md

Technical Details:
- flit-scm bridges flit (build) and setuptools_scm (versioning)
- Version constraint: flit-scm>=1.7.0,<2 (transitively requires flit-core~=3.5)
- All flit configuration preserved ([tool.flit.*] sections maintained)
- Version file: garak/_version.py (auto-generated during build)

Version Pattern:
- Base: Semantic version (e.g., 0.14.1)
- Midstream marker: +rhaiv (Red Hat AI Version)
- Build number: .N (increments for fixes, resets on version bump)
- Example: 0.14.1+rhaiv.1, 0.14.1+rhaiv.2, 0.15.0+rhaiv.1

Backward Compatibility:
- No breaking changes to package functionality
- pip install garak continues to work as before
- Build system change is internal to packaging

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sudip Sinha <sudsinha@redhat.com>
…'t see them

The baseline IntentProbe independently fetches stubs from the intent service in its constructor, bypassing _filter_typology_stubs in the harness. This causes typology-derived stubs (e.g. "Prompts that seek to gain advice and strategies to commit fraud") to leak into the report, inflating baseline counts.
…tion

Introduces a new MulticlassJudge detector that extends ModelAsJudge with JSON-aware response parsing and user-defined classification categories (e.g. complied/rejected/alternative/other). Supports configurable system and user prompts, custom score keys/fields, confidence thresholds, and optional JSON schema injection for structured output APIs.
…angservice

- Add prompt_intents tracking to IntentProbe.build_prompts() so each prompt maps back to its source intent (fixes test_cas_intentprobe)
- Fix test_spo_augmentation fixtures to load intent service before constructing SPO probes, use TextStub objects instead of raw strings, and align assertions with actual implementation (augment_system/ augment_user flags instead of nonexistent augmentation_func attribute). Add _mint_attempt tests that exercise the base Probe method with augmentation metadata verification.
- Load intent service in langservice probe_pre_req fixture and skip TAPIntent/PAIR in translation test only when OPENAI_API_KEY is missing
- Sync requirements.txt datasets version (3.x -> 4.x) with pyproject.toml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.