New release: llm comparator and confidence evaluation by ayushi1208 · Pull Request #78 · awslabs/stickler

ayushi1208 · 2026-02-12T20:06:29Z

Issue #, if available:

Description of changes:
1. Confidence Scoring with AUROC
Added support for confidence-based evaluation metrics using Area Under the Receiver Operating Characteristic (AUROC) curve
Extended JSON schema to accept: {'value': x, 'confidence': x}
AUROC calculation triggered via compare_with() function flag
2. LLM Comparator
New LLM-based comparator implementation using strands
Includes comprehensive pytest coverage
Mocked external dependencies (strands-agents, botocore) for reliable testing
3. Bulk Evaluator Aggregations
Enhanced bulk evaluation capabilities with aggregation support

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Fixed Doc Guide removing merge conflict notation.

Update LLM comparator to use Strands, add tests and example script

* Update: add 3 file(s), modify 3 file(s) * Update: modify 1 file(s) * Update: modify 1 file(s) * Update: modify 162 file(s) * Update: modify 24 file(s)

* Added a mock class for strands to mitigate import errors, and raise an import error for trying to use LLM * Updated jinja template package version

* Mocked strands agents in test for llm comparator * Addressed PR comments to add a conftest and docstring at the file level

These commits enable the project to evaluate model confidence scores using AUROC metrics, with proper error handling, comprehensive tests, and clear documentation.

Co-authored-by: Aditya Addepalli <adiadd@amazon.com>

* Added an aggregate_from_comparisons() function that takes a list of compare_with() results for aggregations * Added test cases for the aggregate and update functions in bulk evaluator * Added documentation on the new evaluation aggregation function

github-actions · 2026-02-12T23:49:11Z

🔒 Security Scan Results

ASH Security Scan Report

Report generated: 2026-02-12T23:49:03+00:00
Time since scan: 2 minutes

Scan Metadata

Project: ASH
Scan executed: 2026-02-12T23:47:03+00:00
ASH version: 3.1.2

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

Severity levels:
- Suppressed (S): Findings that have been explicitly suppressed and don't affect scanner status
- Critical (C): Highest severity findings that require immediate attention
- High (H): Serious findings that should be addressed soon
- Medium (M): Moderate risk findings
- Low (L): Lower risk findings
- Info (I): Informational findings with minimal risk
Duration (Time): Time taken by the scanner to complete its execution
Actionable: Number of findings at or above the threshold severity level that require attention
Result:
- PASSED = No findings at or above threshold
- FAILED = Findings at or above threshold
- MISSING = Required dependencies not available
- SKIPPED = Scanner explicitly disabled
- ERROR = Scanner execution error
Threshold: The minimum severity level that will cause a scanner to fail
- Thresholds: ALL, LOW, MEDIUM, HIGH, CRITICAL
- Source: Values in parentheses indicate where the threshold is set:
  - global (global_settings section in the ASH_CONFIG used)
  - config (scanner config section in the ASH_CONFIG used)
  - scanner (default configuration in the plugin, if explicitly set)
Statistics calculation:
- All statistics are calculated from the final aggregated SARIF report
- Suppressed findings are counted separately and do not contribute to actionable findings
- Scanner status is determined by comparing actionable findings to the threshold

Scanner	Critical	Low	Actionable	Result	Threshold
bandit	0	2979	0	PASSED	MEDIUM (global)
cdk-nag	0	0	0	PASSED	MEDIUM (global)
cfn-nag	0	0	0	MISSING	MEDIUM (global)
checkov	1	0	1	SKIPPED	MEDIUM (global)
detect-secrets	0	0	0	PASSED	MEDIUM (global)
grype	0	0	0	MISSING	MEDIUM (global)
npm-audit	0	0	0	PASSED	MEDIUM (global)
opengrep	0	0	0	MISSING	MEDIUM (global)
semgrep	1	0	1	FAILED	MEDIUM (global)
syft	0	0	0	MISSING	MEDIUM (global)

Top 2 Hotspots

Files with the highest number of security findings:

Finding Count	File Location
1	.github/workflows/workflow.yml
1	src/stickler/reporting/html/interactive/main.js

Detailed Findings

Show 2 actionable findings

Finding 1: CKV2_GHA_1

Severity: HIGH
Scanner: checkov
Rule ID: CKV2_GHA_1
Location: .github/workflows/workflow.yml:11-12

Description:
Ensure top-level permissions are not set to write-all

Finding 2: javascript.browser.security.insecure-document-method.insecure-document-method

Severity: HIGH
Scanner: semgrep
Rule ID: javascript.browser.security.insecure-document-method.insecure-document-method
Location: src/stickler/reporting/html/interactive/main.js:407-414

Description:
User controlled data in methods like innerHTML, outerHTML or document.write is an anti-pattern that can lead to XSS vulnerabilities

Code Snippet:

pdfItem.innerHTML = `
                <div class="pdf-container">
                    <canvas id="pdf-canvas-${escapeHtml(docId)}" class="pdf-canvas"></canvas>
                    <div class="pdf-loading" id="pdf-loading-${escpeHtml(docId)}">Loading PDF...</div>
                    <div class="pdf-error" id="pdf-error-${escapeHtml(docId)}" style="display: none;">Error loading PDF</div>
                </div>
                <p><strong>${escapeHtml(docId)}</strong></p>
            `;

Report generated by Automated Security Helper (ASH) at 2026-02-12T23:49:04+00:00

sromoam · 2026-02-16T18:31:16Z

@adiadd @vawsgit can we get this one merged in the next two days? this will be blocking to aggregate features in the accelerator

* Updating the js file to fix the HTML based warnings from teh ASH scan * Fixing dead code throwing warnings in tests. * registering slow tests with pyproj inclusion.

github-actions · 2026-02-16T21:58:10Z

🔒 Security Scan Results

ASH Security Scan Report

Report generated: 2026-02-16T21:58:02+00:00
Time since scan: 1 minute

Scan Metadata

Project: ASH
Scan executed: 2026-02-16T21:56:06+00:00
ASH version: 3.1.2

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

Severity levels:
- Suppressed (S): Findings that have been explicitly suppressed and don't affect scanner status
- Critical (C): Highest severity findings that require immediate attention
- High (H): Serious findings that should be addressed soon
- Medium (M): Moderate risk findings
- Low (L): Lower risk findings
- Info (I): Informational findings with minimal risk
Duration (Time): Time taken by the scanner to complete its execution
Actionable: Number of findings at or above the threshold severity level that require attention
Result:
- PASSED = No findings at or above threshold
- FAILED = Findings at or above threshold
- MISSING = Required dependencies not available
- SKIPPED = Scanner explicitly disabled
- ERROR = Scanner execution error
Threshold: The minimum severity level that will cause a scanner to fail
- Thresholds: ALL, LOW, MEDIUM, HIGH, CRITICAL
- Source: Values in parentheses indicate where the threshold is set:
  - global (global_settings section in the ASH_CONFIG used)
  - config (scanner config section in the ASH_CONFIG used)
  - scanner (default configuration in the plugin, if explicitly set)
Statistics calculation:
- All statistics are calculated from the final aggregated SARIF report
- Suppressed findings are counted separately and do not contribute to actionable findings
- Scanner status is determined by comparing actionable findings to the threshold

Scanner	Critical	Low	Actionable	Result	Threshold
bandit	0	2979	0	PASSED	MEDIUM (global)
cdk-nag	0	0	0	PASSED	MEDIUM (global)
cfn-nag	0	0	0	MISSING	MEDIUM (global)
checkov	1	0	1	SKIPPED	MEDIUM (global)
detect-secrets	0	0	0	PASSED	MEDIUM (global)
grype	0	0	0	MISSING	MEDIUM (global)
npm-audit	0	0	0	PASSED	MEDIUM (global)
opengrep	0	0	0	MISSING	MEDIUM (global)
semgrep	0	0	0	PASSED	MEDIUM (global)
syft	0	0	0	MISSING	MEDIUM (global)

Top 1 Hotspots

Files with the highest number of security findings:

Finding Count	File Location
1	.github/workflows/workflow.yml

Detailed Findings

Show 1 actionable findings

Finding 1: CKV2_GHA_1

Severity: HIGH
Scanner: checkov
Rule ID: CKV2_GHA_1
Location: .github/workflows/workflow.yml:11-12

Description:
Ensure top-level permissions are not set to write-all

Report generated by Automated Security Helper (ASH) at 2026-02-16T21:58:03+00:00

sromoam · 2026-02-17T14:32:57Z

@vawsgit @adiadd can we tag this and bag this today?

Sync version across pyproject.toml and __init__.py (was 0.1.4 and 0.1.0 respectively) in preparation for v0.1.5 release.

github-actions · 2026-02-17T14:54:52Z

🔒 Security Scan Results

ASH Security Scan Report

Report generated: 2026-02-17T14:54:42+00:00
Time since scan: 1 minute

Scan Metadata

Project: ASH
Scan executed: 2026-02-17T14:52:45+00:00
ASH version: 3.1.2

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

Severity levels:
- Suppressed (S): Findings that have been explicitly suppressed and don't affect scanner status
- Critical (C): Highest severity findings that require immediate attention
- High (H): Serious findings that should be addressed soon
- Medium (M): Moderate risk findings
- Low (L): Lower risk findings
- Info (I): Informational findings with minimal risk
Duration (Time): Time taken by the scanner to complete its execution
Actionable: Number of findings at or above the threshold severity level that require attention
Result:
- PASSED = No findings at or above threshold
- FAILED = Findings at or above threshold
- MISSING = Required dependencies not available
- SKIPPED = Scanner explicitly disabled
- ERROR = Scanner execution error
Threshold: The minimum severity level that will cause a scanner to fail
- Thresholds: ALL, LOW, MEDIUM, HIGH, CRITICAL
- Source: Values in parentheses indicate where the threshold is set:
  - global (global_settings section in the ASH_CONFIG used)
  - config (scanner config section in the ASH_CONFIG used)
  - scanner (default configuration in the plugin, if explicitly set)
Statistics calculation:
- All statistics are calculated from the final aggregated SARIF report
- Suppressed findings are counted separately and do not contribute to actionable findings
- Scanner status is determined by comparing actionable findings to the threshold

Scanner	Critical	Low	Actionable	Result	Threshold
bandit	0	2979	0	PASSED	MEDIUM (global)
cdk-nag	0	0	0	PASSED	MEDIUM (global)
cfn-nag	0	0	0	MISSING	MEDIUM (global)
checkov	1	0	1	SKIPPED	MEDIUM (global)
detect-secrets	0	0	0	PASSED	MEDIUM (global)
grype	0	0	0	MISSING	MEDIUM (global)
npm-audit	0	0	0	PASSED	MEDIUM (global)
opengrep	0	0	0	MISSING	MEDIUM (global)
semgrep	0	0	0	PASSED	MEDIUM (global)
syft	0	0	0	MISSING	MEDIUM (global)

Top 1 Hotspots

Files with the highest number of security findings:

Finding Count	File Location
1	.github/workflows/workflow.yml

Detailed Findings

Show 1 actionable findings

Finding 1: CKV2_GHA_1

Severity: HIGH
Scanner: checkov
Rule ID: CKV2_GHA_1
Location: .github/workflows/workflow.yml:11-12

Description:
Ensure top-level permissions are not set to write-all

Report generated by Automated Security Helper (ASH) at 2026-02-17T14:54:43+00:00

ayushi1208 closed this Feb 12, 2026

ayushi1208 reopened this Feb 12, 2026

ayushi1208 and others added 9 commits February 12, 2026 18:45

Fixed Doc Guide (#53)

952fb85

Fixed Doc Guide removing merge conflict notation.

Fixed LLM comparator and added pytests for it (#15)

e93d975

Update LLM comparator to use Strands, add tests and example script

🔧 Fix All Ruff Linting Errors (59 Errors Fixed) (#61)

aba4ff6

* Update: add 3 file(s), modify 3 file(s) * Update: modify 1 file(s) * Update: modify 1 file(s) * Update: modify 162 file(s) * Update: modify 24 file(s)

Strands and Jinja2 package updates (#60)

af2fa1d

* Added a mock class for strands to mitigate import errors, and raise an import error for trying to use LLM * Updated jinja template package version

Updated for AGENTS.md (#57)

2254969

Mocked strands-agents+botocore in test for llm comparator (#67)

91e58e0

* Mocked strands agents in test for llm comparator * Addressed PR comments to add a conftest and docstring at the file level

Feat/confidence (#54)

57e6469

These commits enable the project to evaluate model confidence scores using AUROC metrics, with proper error handling, comprehensive tests, and clear documentation.

feat: adding stickler deepwiki badge (#73)

e6d9f77

Co-authored-by: Aditya Addepalli <adiadd@amazon.com>

ayushi1208 force-pushed the dev branch from ec52553 to 2933418 Compare February 12, 2026 23:45

ayushi1208 marked this pull request as ready for review February 12, 2026 23:50

ayushi1208 requested a review from vawsgit February 12, 2026 23:53

Fixing the ASH findings, and Test Warnings (#80)

f9f1ff5

* Updating the js file to fix the HTML based warnings from teh ASH scan * Fixing dead code throwing warnings in tests. * registering slow tests with pyproj inclusion.

adiadd approved these changes Feb 17, 2026

View reviewed changes

chore: bump version to 0.1.5 for release

ed1fcfd

Sync version across pyproject.toml and __init__.py (was 0.1.4 and 0.1.0 respectively) in preparation for v0.1.5 release.

sromoam merged commit d4faff8 into main Feb 17, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New release: llm comparator and confidence evaluation#78

New release: llm comparator and confidence evaluation#78
sromoam merged 11 commits intomainfrom
dev

ayushi1208 commented Feb 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 12, 2026

Finding 1: CKV2_GHA_1

Finding 2: javascript.browser.security.insecure-document-method.insecure-document-method

Uh oh!

sromoam commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Finding 1: CKV2_GHA_1

Uh oh!

sromoam commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Finding 1: CKV2_GHA_1

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ayushi1208 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 12, 2026

🔒 Security Scan Results

ASH Security Scan Report

Scan Metadata

Summary

Scanner Results

Top 2 Hotspots

Detailed Findings

Finding 1: CKV2_GHA_1

Finding 2: javascript.browser.security.insecure-document-method.insecure-document-method

Uh oh!

sromoam commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

🔒 Security Scan Results

ASH Security Scan Report

Scan Metadata

Summary

Scanner Results

Top 1 Hotspots

Detailed Findings

Finding 1: CKV2_GHA_1

Uh oh!

sromoam commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

🔒 Security Scan Results

ASH Security Scan Report

Scan Metadata

Summary

Scanner Results

Top 1 Hotspots

Detailed Findings

Finding 1: CKV2_GHA_1

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ayushi1208 commented Feb 12, 2026 •

edited

Loading