Conversation
Fixed Doc Guide removing merge conflict notation.
Update LLM comparator to use Strands, add tests and example script
* Update: add 3 file(s), modify 3 file(s) * Update: modify 1 file(s) * Update: modify 1 file(s) * Update: modify 162 file(s) * Update: modify 24 file(s)
* Added a mock class for strands to mitigate import errors, and raise an import error for trying to use LLM * Updated jinja template package version
* Mocked strands agents in test for llm comparator * Addressed PR comments to add a conftest and docstring at the file level
These commits enable the project to evaluate model confidence scores using AUROC metrics, with proper error handling, comprehensive tests, and clear documentation.
Co-authored-by: Aditya Addepalli <adiadd@amazon.com>
* Added an aggregate_from_comparisons() function that takes a list of compare_with() results for aggregations * Added test cases for the aggregate and update functions in bulk evaluator * Added documentation on the new evaluation aggregation function
🔒 Security Scan ResultsASH Security Scan Report
Scan Metadata
SummaryScanner ResultsThe table below shows findings by scanner, with status based on severity thresholds and dependencies:
Top 2 HotspotsFiles with the highest number of security findings:
Detailed FindingsShow 2 actionable findingsFinding 1: CKV2_GHA_1
Description: Finding 2: javascript.browser.security.insecure-document-method.insecure-document-method
Description: Code Snippet: Report generated by Automated Security Helper (ASH) at 2026-02-12T23:49:04+00:00 |
* Updating the js file to fix the HTML based warnings from teh ASH scan * Fixing dead code throwing warnings in tests. * registering slow tests with pyproj inclusion.
🔒 Security Scan ResultsASH Security Scan Report
Scan Metadata
SummaryScanner ResultsThe table below shows findings by scanner, with status based on severity thresholds and dependencies:
Top 1 HotspotsFiles with the highest number of security findings:
Detailed FindingsShow 1 actionable findingsFinding 1: CKV2_GHA_1
Description: Report generated by Automated Security Helper (ASH) at 2026-02-16T21:58:03+00:00 |
Sync version across pyproject.toml and __init__.py (was 0.1.4 and 0.1.0 respectively) in preparation for v0.1.5 release.
🔒 Security Scan ResultsASH Security Scan Report
Scan Metadata
SummaryScanner ResultsThe table below shows findings by scanner, with status based on severity thresholds and dependencies:
Top 1 HotspotsFiles with the highest number of security findings:
Detailed FindingsShow 1 actionable findingsFinding 1: CKV2_GHA_1
Description: Report generated by Automated Security Helper (ASH) at 2026-02-17T14:54:43+00:00 |
Issue #, if available:
Description of changes:
1. Confidence Scoring with AUROC
Added support for confidence-based evaluation metrics using Area Under the Receiver Operating Characteristic (AUROC) curve
Extended JSON schema to accept: {'value': x, 'confidence': x}
AUROC calculation triggered via compare_with() function flag
2. LLM Comparator
New LLM-based comparator implementation using strands
Includes comprehensive pytest coverage
Mocked external dependencies (strands-agents, botocore) for reliable testing
3. Bulk Evaluator Aggregations
Enhanced bulk evaluation capabilities with aggregation support
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.