Skip to content

New release: llm comparator and confidence evaluation#78

Merged
sromoam merged 11 commits intomainfrom
dev
Feb 17, 2026
Merged

New release: llm comparator and confidence evaluation#78
sromoam merged 11 commits intomainfrom
dev

Conversation

@ayushi1208
Copy link
Contributor

@ayushi1208 ayushi1208 commented Feb 12, 2026

Issue #, if available:

Description of changes:
1. Confidence Scoring with AUROC
Added support for confidence-based evaluation metrics using Area Under the Receiver Operating Characteristic (AUROC) curve
Extended JSON schema to accept: {'value': x, 'confidence': x}
AUROC calculation triggered via compare_with() function flag
2. LLM Comparator
New LLM-based comparator implementation using strands
Includes comprehensive pytest coverage
Mocked external dependencies (strands-agents, botocore) for reliable testing
3. Bulk Evaluator Aggregations
Enhanced bulk evaluation capabilities with aggregation support

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@ayushi1208 ayushi1208 closed this Feb 12, 2026
@ayushi1208 ayushi1208 reopened this Feb 12, 2026
ayushi1208 and others added 9 commits February 12, 2026 18:45
Fixed Doc Guide removing merge conflict notation.
Update LLM comparator to use Strands, add tests and example script
* Update: add 3 file(s), modify 3 file(s)

* Update: modify 1 file(s)

* Update: modify 1 file(s)

* Update: modify 162 file(s)

* Update: modify 24 file(s)
* Added a mock class for strands to mitigate import errors, and raise an import error for trying to use LLM

* Updated jinja template package version
* Mocked strands agents in test for llm comparator

* Addressed PR comments to add a conftest and docstring at the file level
These commits enable the project to evaluate model confidence scores using AUROC metrics, with proper error handling, comprehensive tests, and clear documentation.
Co-authored-by: Aditya Addepalli <adiadd@amazon.com>
* Added an aggregate_from_comparisons() function that takes a list of compare_with() results for aggregations

* Added test cases for the aggregate and update functions in bulk evaluator

* Added documentation on the new evaluation aggregation function
@github-actions
Copy link

🔒 Security Scan Results

ASH Security Scan Report

  • Report generated: 2026-02-12T23:49:03+00:00
  • Time since scan: 2 minutes

Scan Metadata

  • Project: ASH
  • Scan executed: 2026-02-12T23:47:03+00:00
  • ASH version: 3.1.2

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

  • Severity levels:
    • Suppressed (S): Findings that have been explicitly suppressed and don't affect scanner status
    • Critical (C): Highest severity findings that require immediate attention
    • High (H): Serious findings that should be addressed soon
    • Medium (M): Moderate risk findings
    • Low (L): Lower risk findings
    • Info (I): Informational findings with minimal risk
  • Duration (Time): Time taken by the scanner to complete its execution
  • Actionable: Number of findings at or above the threshold severity level that require attention
  • Result:
    • PASSED = No findings at or above threshold
    • FAILED = Findings at or above threshold
    • MISSING = Required dependencies not available
    • SKIPPED = Scanner explicitly disabled
    • ERROR = Scanner execution error
  • Threshold: The minimum severity level that will cause a scanner to fail
    • Thresholds: ALL, LOW, MEDIUM, HIGH, CRITICAL
    • Source: Values in parentheses indicate where the threshold is set:
      • global (global_settings section in the ASH_CONFIG used)
      • config (scanner config section in the ASH_CONFIG used)
      • scanner (default configuration in the plugin, if explicitly set)
  • Statistics calculation:
    • All statistics are calculated from the final aggregated SARIF report
    • Suppressed findings are counted separately and do not contribute to actionable findings
    • Scanner status is determined by comparing actionable findings to the threshold
Scanner Suppressed Critical High Medium Low Info Actionable Result Threshold
bandit 0 0 0 0 2979 0 0 PASSED MEDIUM (global)
cdk-nag 0 0 0 0 0 0 0 PASSED MEDIUM (global)
cfn-nag 0 0 0 0 0 0 0 MISSING MEDIUM (global)
checkov 0 1 0 0 0 0 1 SKIPPED MEDIUM (global)
detect-secrets 0 0 0 0 0 0 0 PASSED MEDIUM (global)
grype 0 0 0 0 0 0 0 MISSING MEDIUM (global)
npm-audit 0 0 0 0 0 0 0 PASSED MEDIUM (global)
opengrep 0 0 0 0 0 0 0 MISSING MEDIUM (global)
semgrep 0 1 0 0 0 0 1 FAILED MEDIUM (global)
syft 0 0 0 0 0 0 0 MISSING MEDIUM (global)

Top 2 Hotspots

Files with the highest number of security findings:

Finding Count File Location
1 .github/workflows/workflow.yml
1 src/stickler/reporting/html/interactive/main.js

Detailed Findings

Show 2 actionable findings

Finding 1: CKV2_GHA_1

  • Severity: HIGH
  • Scanner: checkov
  • Rule ID: CKV2_GHA_1
  • Location: .github/workflows/workflow.yml:11-12

Description:
Ensure top-level permissions are not set to write-all


Finding 2: javascript.browser.security.insecure-document-method.insecure-document-method

  • Severity: HIGH
  • Scanner: semgrep
  • Rule ID: javascript.browser.security.insecure-document-method.insecure-document-method
  • Location: src/stickler/reporting/html/interactive/main.js:407-414

Description:
User controlled data in methods like innerHTML, outerHTML or document.write is an anti-pattern that can lead to XSS vulnerabilities

Code Snippet:

pdfItem.innerHTML = `
                <div class="pdf-container">
                    <canvas id="pdf-canvas-${escapeHtml(docId)}" class="pdf-canvas"></canvas>
                    <div class="pdf-loading" id="pdf-loading-${escpeHtml(docId)}">Loading PDF...</div>
                    <div class="pdf-error" id="pdf-error-${escapeHtml(docId)}" style="display: none;">Error loading PDF</div>
                </div>
                <p><strong>${escapeHtml(docId)}</strong></p>
            `;

Report generated by Automated Security Helper (ASH) at 2026-02-12T23:49:04+00:00

@ayushi1208 ayushi1208 marked this pull request as ready for review February 12, 2026 23:50
@ayushi1208 ayushi1208 requested a review from vawsgit February 12, 2026 23:53
@sromoam
Copy link
Contributor

sromoam commented Feb 16, 2026

@adiadd @vawsgit can we get this one merged in the next two days? this will be blocking to aggregate features in the accelerator

* Updating the js file to fix the HTML based warnings from teh ASH scan
* Fixing dead code throwing warnings in tests.
* registering slow tests with pyproj inclusion.
@github-actions
Copy link

🔒 Security Scan Results

ASH Security Scan Report

  • Report generated: 2026-02-16T21:58:02+00:00
  • Time since scan: 1 minute

Scan Metadata

  • Project: ASH
  • Scan executed: 2026-02-16T21:56:06+00:00
  • ASH version: 3.1.2

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

  • Severity levels:
    • Suppressed (S): Findings that have been explicitly suppressed and don't affect scanner status
    • Critical (C): Highest severity findings that require immediate attention
    • High (H): Serious findings that should be addressed soon
    • Medium (M): Moderate risk findings
    • Low (L): Lower risk findings
    • Info (I): Informational findings with minimal risk
  • Duration (Time): Time taken by the scanner to complete its execution
  • Actionable: Number of findings at or above the threshold severity level that require attention
  • Result:
    • PASSED = No findings at or above threshold
    • FAILED = Findings at or above threshold
    • MISSING = Required dependencies not available
    • SKIPPED = Scanner explicitly disabled
    • ERROR = Scanner execution error
  • Threshold: The minimum severity level that will cause a scanner to fail
    • Thresholds: ALL, LOW, MEDIUM, HIGH, CRITICAL
    • Source: Values in parentheses indicate where the threshold is set:
      • global (global_settings section in the ASH_CONFIG used)
      • config (scanner config section in the ASH_CONFIG used)
      • scanner (default configuration in the plugin, if explicitly set)
  • Statistics calculation:
    • All statistics are calculated from the final aggregated SARIF report
    • Suppressed findings are counted separately and do not contribute to actionable findings
    • Scanner status is determined by comparing actionable findings to the threshold
Scanner Suppressed Critical High Medium Low Info Actionable Result Threshold
bandit 0 0 0 0 2979 0 0 PASSED MEDIUM (global)
cdk-nag 0 0 0 0 0 0 0 PASSED MEDIUM (global)
cfn-nag 0 0 0 0 0 0 0 MISSING MEDIUM (global)
checkov 0 1 0 0 0 0 1 SKIPPED MEDIUM (global)
detect-secrets 0 0 0 0 0 0 0 PASSED MEDIUM (global)
grype 0 0 0 0 0 0 0 MISSING MEDIUM (global)
npm-audit 0 0 0 0 0 0 0 PASSED MEDIUM (global)
opengrep 0 0 0 0 0 0 0 MISSING MEDIUM (global)
semgrep 0 0 0 0 0 0 0 PASSED MEDIUM (global)
syft 0 0 0 0 0 0 0 MISSING MEDIUM (global)

Top 1 Hotspots

Files with the highest number of security findings:

Finding Count File Location
1 .github/workflows/workflow.yml

Detailed Findings

Show 1 actionable findings

Finding 1: CKV2_GHA_1

  • Severity: HIGH
  • Scanner: checkov
  • Rule ID: CKV2_GHA_1
  • Location: .github/workflows/workflow.yml:11-12

Description:
Ensure top-level permissions are not set to write-all


Report generated by Automated Security Helper (ASH) at 2026-02-16T21:58:03+00:00

@sromoam
Copy link
Contributor

sromoam commented Feb 17, 2026

@vawsgit @adiadd can we tag this and bag this today?

Sync version across pyproject.toml and __init__.py (was 0.1.4 and
0.1.0 respectively) in preparation for v0.1.5 release.
@github-actions
Copy link

🔒 Security Scan Results

ASH Security Scan Report

  • Report generated: 2026-02-17T14:54:42+00:00
  • Time since scan: 1 minute

Scan Metadata

  • Project: ASH
  • Scan executed: 2026-02-17T14:52:45+00:00
  • ASH version: 3.1.2

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

  • Severity levels:
    • Suppressed (S): Findings that have been explicitly suppressed and don't affect scanner status
    • Critical (C): Highest severity findings that require immediate attention
    • High (H): Serious findings that should be addressed soon
    • Medium (M): Moderate risk findings
    • Low (L): Lower risk findings
    • Info (I): Informational findings with minimal risk
  • Duration (Time): Time taken by the scanner to complete its execution
  • Actionable: Number of findings at or above the threshold severity level that require attention
  • Result:
    • PASSED = No findings at or above threshold
    • FAILED = Findings at or above threshold
    • MISSING = Required dependencies not available
    • SKIPPED = Scanner explicitly disabled
    • ERROR = Scanner execution error
  • Threshold: The minimum severity level that will cause a scanner to fail
    • Thresholds: ALL, LOW, MEDIUM, HIGH, CRITICAL
    • Source: Values in parentheses indicate where the threshold is set:
      • global (global_settings section in the ASH_CONFIG used)
      • config (scanner config section in the ASH_CONFIG used)
      • scanner (default configuration in the plugin, if explicitly set)
  • Statistics calculation:
    • All statistics are calculated from the final aggregated SARIF report
    • Suppressed findings are counted separately and do not contribute to actionable findings
    • Scanner status is determined by comparing actionable findings to the threshold
Scanner Suppressed Critical High Medium Low Info Actionable Result Threshold
bandit 0 0 0 0 2979 0 0 PASSED MEDIUM (global)
cdk-nag 0 0 0 0 0 0 0 PASSED MEDIUM (global)
cfn-nag 0 0 0 0 0 0 0 MISSING MEDIUM (global)
checkov 0 1 0 0 0 0 1 SKIPPED MEDIUM (global)
detect-secrets 0 0 0 0 0 0 0 PASSED MEDIUM (global)
grype 0 0 0 0 0 0 0 MISSING MEDIUM (global)
npm-audit 0 0 0 0 0 0 0 PASSED MEDIUM (global)
opengrep 0 0 0 0 0 0 0 MISSING MEDIUM (global)
semgrep 0 0 0 0 0 0 0 PASSED MEDIUM (global)
syft 0 0 0 0 0 0 0 MISSING MEDIUM (global)

Top 1 Hotspots

Files with the highest number of security findings:

Finding Count File Location
1 .github/workflows/workflow.yml

Detailed Findings

Show 1 actionable findings

Finding 1: CKV2_GHA_1

  • Severity: HIGH
  • Scanner: checkov
  • Rule ID: CKV2_GHA_1
  • Location: .github/workflows/workflow.yml:11-12

Description:
Ensure top-level permissions are not set to write-all


Report generated by Automated Security Helper (ASH) at 2026-02-17T14:54:43+00:00

@sromoam sromoam merged commit d4faff8 into main Feb 17, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants