-
-
Notifications
You must be signed in to change notification settings - Fork 0
feat(classification): add semantic pattern matching for GUID, email, base64, format strings, and user agents #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Updated the `SemanticClassifier` to utilize a new `StringContext` struct for context-aware classification. - Enhanced classification capabilities for GUIDs, emails, Base64 strings, format strings, and user agents. - Improved regex patterns and validation logic to reduce false positives. - Consolidated path detection logic and removed redundant regex patterns. - Updated tests to cover new classification scenarios and ensure performance benchmarks. - Refactored integration tests to validate classification against various binary formats (ELF, PE, Mach-O). - Removed deprecated methods and cleaned up code for better readability and maintainability. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughArchitectural & API RefactoringThis PR restructures the semantic classification system into a context-aware model and consolidates classification logic into src/classification/mod.rs (deleting the old semantic.rs). The classifier API moves away from a FoundString wrapper and toward a lightweight context-driven signature: SemanticClassifier::classify(&self, text: &str, context: &StringContext) plus a backwards-compatible classify_found_string(&self, &FoundString). apply_semantic_enrichment now accepts &ContainerInfo to construct and pass StringContext to the classifier. Key Public API Changes
Binary Format & Section Handling
Semantic Classification Accuracy & Validation
Performance & Reliability
Tooling, Docs & Build
Removals & Breaking Changes
Notes on Review Focus
WalkthroughThis PR rewrites and consolidates the string classification subsystem into a context-aware API. The classifier now accepts (text: &str, context: &StringContext), regex initializers use explicit expect messages, many pattern modules and tests are reorganized, and build/tooling/benchmarking infra (goreleaser, benches, mise, justfile) is added or updated. Changes
Sequence Diagram(s)(omitted — changes are internal library refactors and tests; no multi-actor runtime sequence diagram applicable) Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 6 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR expands Stringy’s semantic classification capabilities and tooling by introducing context-aware classification (via StringContext), adding new tag detectors (GUID/email/base64/format strings/user agents), and improving benchmarking/testing plus release/toolchain configs.
Changes:
- Added
StringContextand rewired semantic enrichment/classification to be context-aware across extraction and tests. - Implemented additional semantic detectors (GUID/email/base64/format strings/user agents) and adjusted domain tagging to avoid email double-tagging.
- Added Criterion benchmarks and updated build/release/toolchain/docs to reflect the new classification pipeline.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
src/types.rs |
Adds StringContext to carry classification context metadata. |
src/lib.rs |
Re-exports StringContext as part of the public API. |
src/extraction/mod.rs |
Passes container-derived context into semantic enrichment/classification. |
src/classification/mod.rs |
Reworks SemanticClassifier into a context-aware classifier and adds new detectors. |
src/classification/patterns/network.rs |
Prevents domain tagging when input looks like an email address. |
src/classification/patterns/mod.rs |
Removes regex re-exports no longer needed for cache testing. |
src/classification/patterns/paths.rs |
Removes REGISTRY_PATH_REGEX and relies on validation logic for registry detection. |
src/classification/semantic.rs |
Updates imports, but now appears out-of-sync with the active classifier/module tree. |
tests/classification_tests.rs |
Adds unit tests for new detectors and context-aware behavior. |
tests/classification_integration_tests.rs |
Adds integration coverage across ELF/PE/Mach-O fixtures plus perf/cache checks. |
tests/classification_integration.rs |
Updates snapshot/perf tests to match the new classifier API and tags. |
tests/snapshots/classification_integration__classification_snapshots.snap |
Updates expected snapshot outputs for new tags. |
benches/classification.rs |
Adds Criterion benchmark suite for classification workloads. |
Cargo.toml |
Adds benchmark target and new dependency (lazy_static); bumps some versions. |
docs/src/classification.md |
Rewrites classification documentation for the new pipeline and tags. |
justfile |
Updates lint/docs tooling invocation and simplifies GoReleaser tasks. |
mise.toml |
Expands toolchain tools/versions and adds more build/documentation tooling. |
.goreleaser.yaml |
Adds GoReleaser config for cross-platform Rust builds and hooks for tool setup. |
| #[must_use] | ||
| pub fn classify(&self, text: &str, context: &StringContext) -> Vec<Tag> { | ||
| let mut tags = Vec::new(); | ||
|
|
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SemanticClassifier::classify has been changed to require (&str, &StringContext), which is a breaking change for any downstream code that previously called classify(&FoundString). If this is intended to be backward-compatible, consider restoring the old classify(&FoundString) signature as a thin wrapper (and optionally deprecate it) while introducing a new name for the context-aware API (e.g., classify_with_context).
src/classification/mod.rs
Outdated
| use lazy_static::lazy_static; | ||
| use regex::Regex; | ||
|
|
||
| use crate::types::{BinaryFormat, SectionType, StringContext, StringSource, Tag}; | ||
|
|
||
| pub mod patterns; | ||
| pub mod ranking; | ||
| pub mod semantic; | ||
| pub mod symbols; | ||
|
|
||
| pub use ranking::{RankingConfig, RankingEngine}; | ||
| pub use semantic::SemanticClassifier; | ||
| pub use symbols::SymbolDemangler; | ||
|
|
||
| // Import pattern classification functions | ||
| use patterns::{ | ||
| classify_domain, classify_ip_addresses, classify_posix_path, classify_registry_path, | ||
| classify_unc_path, classify_url, classify_windows_path, | ||
| }; | ||
|
|
||
| lazy_static! { | ||
| static ref GUID_REGEX: Regex = Regex::new( | ||
| r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$", | ||
| ) | ||
| .expect("Invalid GUID regex"); |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module introduces lazy_static for regex caching, but the rest of the codebase uses once_cell::sync::Lazy for the same purpose (e.g., src/classification/patterns/network.rs:6-22, src/classification/patterns/paths.rs:6-24). Consider switching these regex statics to once_cell::sync::Lazy and dropping the extra lazy_static dependency to keep conventions and dependencies consistent.
| ## Semantic Categories | ||
|
|
||
| ### Network Indicators | ||
| ### GUIDs | ||
|
|
||
| #### URLs | ||
| - Pattern: `\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}` | ||
| - Examples: `{12345678-1234-1234-1234-123456789abc}` | ||
| - Validation: Strict format compliance with braces required | ||
|
|
||
| - **Pattern**: `` https?://[^\s<>"{}|\\^\[\]\`]+ `` | ||
| - **Examples**: `https://api.example.com/v1/users`, `http://malware.com/payload` | ||
| - **Validation**: URL format check with safe character filtering | ||
| - **Security relevance**: High - indicates network communication | ||
| ### Email Addresses | ||
|
|
||
| #### Domain Names | ||
| - Pattern: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}` | ||
| - Examples: `admin@malware.com`, `user.name+tag@example.co.uk` | ||
| - Validation: Single `@`, valid TLD length and characters, no empty parts | ||
|
|
||
| - **Pattern**: `\b(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}\b` | ||
| - **Examples**: `api.example.com`, `malware-c2.net` | ||
| - **Validation**: TLD checking, DNS format compliance | ||
| - **Security relevance**: High - C2 domains, legitimate services | ||
| ### Base64 Data | ||
|
|
||
| #### IP Addresses | ||
| - Pattern: `[A-Za-z0-9+/]{20,}={0,2}` | ||
| - Examples: `U29tZSBsb25nZXIgYmFzZTY0IHN0cmluZw==` | ||
| - Validation: Length >= 20, length divisible by 4, padding rules, entropy threshold | ||
|
|
||
| - **IPv4 Pattern**: `\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b` | ||
| - **IPv6 Pattern**: Comprehensive pattern supporting full notation, compressed notation (`::1`), and mixed notation (`::ffff:192.0.2.1`) | ||
| - **Examples**: `192.168.1.1`, `2001:db8::1`, `[::1]:8080` | ||
| - **Validation**: Two-stage validation using regex pre-filter followed by `std::net::IpAddr` parsing for correctness | ||
| - **Port Handling**: IP addresses with ports (e.g., `192.168.1.1:8080`) are supported by automatically stripping the port suffix before validation | ||
| - **IPv6 Bracket Handling**: Bracketed IPv6 addresses (e.g., `[::1]` and `[::1]:8080`) are supported | ||
| - **False Positive Mitigation**: Version numbers like `1.2.3.4` are accepted as IPv4 addresses by design | ||
| - **Implementation**: See `src/classification/semantic.rs` for the complete implementation | ||
| - **Security relevance**: High - infrastructure indicators | ||
| ### Format Strings | ||
|
|
||
| ### File System Indicators | ||
| - Pattern: `%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}` | ||
| - Examples: `Error: %s at line %d`, `User {0} logged in` | ||
| - Validation: Reasonable specifier count, context-aware thresholds | ||
|
|
||
| #### File Paths | ||
| ### User Agents | ||
|
|
||
| - **POSIX Pattern**: `^/[^\0\n\r]*` | ||
| - **Windows Pattern**: `^[A-Za-z]:\\[^\0\n\r]*` | ||
| - **UNC Pattern**: `^\\\\[a-zA-Z0-9.-]+\\[^\0\n\r]*` | ||
| - **Examples**: `/usr/bin/malware`, `C:\\Windows\\System32\\evil.dll`, `\\\\server\\share\\file.txt` | ||
| - **Validation rules**: Rejects null bytes, newlines, carriage returns; rejects consecutive path separators in POSIX paths (`//`) and consecutive backslashes in Windows paths (for example, `folder\\\\file.txt`), while allowing UNC paths that start with `\\\\`; applies a reasonable length limit (4096 max, stricter for unknown prefixes); POSIX paths must be absolute (start with `/`); Windows paths must use backslashes and a valid drive letter | ||
| - **Suspicious path examples**: `/etc/cron.d/`, `/etc/init.d/`, `/usr/local/bin/`, `/tmp/`, `/var/tmp/`; `C:\\Windows\\System32\\`, `C:\\Windows\\Temp\\`, `...\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Startup\\` | ||
| - **Security relevance**: Medium-High - persistence and execution locations | ||
|
|
||
| #### Registry Paths | ||
|
|
||
| - **Full root pattern**: `^HKEY_[A-Z_]+\\[^\0\n\r]*` | ||
| - **Abbreviated root pattern**: `^HK(LM|CU|CR|U|CC)\\[^\0\n\r]*` | ||
| - **Supported root keys**: | ||
| - `HKEY_LOCAL_MACHINE` | ||
| - `HKEY_CURRENT_USER` | ||
| - `HKEY_CLASSES_ROOT` | ||
| - `HKEY_USERS` | ||
| - `HKEY_CURRENT_CONFIG` | ||
| - **Supported abbreviations**: | ||
| - `HKLM`, `HKCU`, `HKCR`, `HKU`, `HKCC` | ||
| - **Suspicious registry paths**: | ||
| - `\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Run` | ||
| - `\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\RunOnce` | ||
| - `\\System\\CurrentControlSet\\Services` | ||
| - `\\SOFTWARE\\Microsoft\\Windows NT\\CurrentVersion\\Winlogon` | ||
| - **Examples**: | ||
| - `HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Run` | ||
| - `HKCU\\Software\\Microsoft` | ||
| - **Security relevance**: High - persistence mechanisms | ||
|
|
||
| ### Identifiers | ||
|
|
||
| #### GUIDs/UUIDs | ||
|
|
||
| - **Pattern**: `\{?[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}?` | ||
| - **Examples**: `{12345678-1234-1234-1234-123456789abc}`, `12345678-1234-1234-1234-123456789abc` | ||
| - **Validation**: Format compliance | ||
| - **Security relevance**: Medium - component identification | ||
|
|
||
| #### Email Addresses | ||
|
|
||
| - **Pattern**: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}` | ||
| - **Examples**: `admin@malware.com`, `support@legitimate.org` | ||
| - **Validation**: Basic format validation | ||
| - **Security relevance**: Medium - contact information | ||
|
|
||
| ### Code Artifacts | ||
|
|
||
| #### Format Strings | ||
|
|
||
| - **Pattern**: `%[-+0 #]*(\d+|\*)?(\.(\d+|\*))?(hh?|ll?|[Lzjt])?[diouxXeEfFgGaAcspn%]` | ||
| - **Examples**: `Error: %s at line %d`, `Name: %s, Age: %d, Score: %.2f` | ||
| - **Context**: Presence of real format specifiers (%% alone is ignored) | ||
| - **Security relevance**: Low-Medium - debugging information | ||
|
|
||
| #### Base64 Data | ||
|
|
||
| - **Pattern**: Character set validation with padding rules | ||
| - **Examples**: `SGVsbG8gV29ybGQ=` | ||
| - **Validation**: Length >= 16, Base64 character set, valid padding, reject length mod 4 of 1 | ||
| - **Security relevance**: Variable - encoded payloads | ||
|
|
||
| #### User Agents | ||
|
|
||
| - **Pattern**: Prefix match for common agents (Mozilla, curl, Wget, python-requests, libwww-perl, Java, Apache-HttpClient, okhttp, PostmanRuntime) | ||
| - **Examples**: `Mozilla/5.0 (Windows NT 10.0; Win64; x64)`, `curl/7.68.0` | ||
| - **Security relevance**: Medium - network fingerprinting | ||
|
|
||
| ## Tag Specificity | ||
|
|
||
| Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk. | ||
| - Pattern: `Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+` | ||
| - Examples: `Mozilla/5.0 (Windows NT 10.0; Win64; x64)`, `Chrome/117.0.5938.92` | ||
| - Validation: Known browser identifiers and minimum length |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rewritten documentation lists only GUID/email/base64/format-string/user-agent categories, but the classifier still supports URL/domain/IP/file-path/registry-path tagging (see src/classification/mod.rs capability list and implementation). Please update this doc to include the still-supported tag categories so it matches the current behavior.
| before: | ||
| hooks: | ||
| # if you don't do these things before calling goreleaser, it might be a | ||
| # good idea to do them here: | ||
| - rustup default stable | ||
| - cargo install --locked cargo-zigbuild | ||
| - cargo fetch --locked |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The before.hooks steps have reproducibility/side-effect concerns: rustup default stable mutates the global toolchain (and the repo already pins a toolchain via rust-toolchain.toml), and cargo install --locked cargo-zigbuild does not pin a cargo-zigbuild version. Consider removing the rustup default step and pinning cargo-zigbuild with --version (or relying on a preinstalled/pinned tool) to make release builds deterministic.
| git-cliff = "2.11.0" | ||
| goreleaser = "2.13.3" | ||
| just = "latest" | ||
| markdownlint-cli2 = "0.20.0" |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just = "latest" reduces toolchain reproducibility compared to the previous pinned version, which can make CI/local behavior drift over time. If reproducible builds are a goal, consider pinning just to a specific version here (similar to other tools).
| [dependencies] | ||
| clap = { version = "4.5.54", features = [ "derive" ] } | ||
| cpp_demangle = "0.5.1" | ||
| entropy = "0.4.2" | ||
| entropy = "0.4.3" | ||
| goblin = "0.10.4" | ||
| lazy_static = "1.5.0" | ||
| once_cell = "1.21.3" | ||
| pelite = "0.10.0" | ||
| regex = "1.12.2" |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lazy_static was added as a new dependency, but existing regex statics in this repo use once_cell::sync::Lazy. If you switch the new classifier regex statics to once_cell, you can likely remove lazy_static here to avoid an extra dependency and keep the project consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/classification/patterns/network.rs (1)
114-141: Add test coverage for email exclusion from domain classification.The existing tests cover URL exclusion but not the new email exclusion behavior.
🧪 Suggested test addition
#[test] fn test_domain_valid_and_invalid() { assert!(classify_domain("example.com").is_some()); assert!(classify_domain("https://example.com").is_none()); assert!(classify_domain("invalid.xyz123").is_none()); } +#[test] +fn test_email_not_tagged_as_domain() { + // Email addresses should not be tagged as domains + assert!(classify_domain("user@example.com").is_none()); + assert!(classify_domain("admin@subdomain.example.org").is_none()); +}src/lib.rs (1)
69-73: Add crate-level safety attributes to the library root.The
#![forbid(unsafe_code)]and#![deny(warnings)]attributes are missing fromsrc/lib.rs. These must be present at the crate root before any module declarations. Add them immediately after the crate documentation block and before thepub modstatements:#![forbid(unsafe_code)] #![deny(warnings)] pub mod classification;tests/classification_integration.rs (1)
50-72: Avoid timing-based assertion in unit tests.
Line 72 enforces a 500ms budget, which is CI and load sensitive and can cause flaky failures. Gate this behind an env flag or move it to benchmarks.As per coding guidelines, avoid timing-dependent assertions in tests.Proposed gating to avoid flaky CI runs
#[test] fn test_classification_performance() { + if std::env::var("STRINGY_PERF_TESTS").is_err() { + return; + } let classifier = SemanticClassifier::new(); ... assert!(elapsed < Duration::from_millis(500)); }
🤖 Fix all issues with AI agents
In @.goreleaser.yaml:
- Around line 30-38: The rust builder block (builder: rust with id: cross) is
missing the required command for cargo-zigbuild; update that block to add
"command: zigbuild" so GoReleaser uses cargo-zigbuild instead of the default
cargo build when cross-compiling (ensuring targets like x86_64-unknown-linux-gnu
and aarch64-unknown-linux-gnu are built correctly).
In `@Cargo.toml`:
- Around line 27-28: Inconsistency: some pattern modules use
once_cell::sync::Lazy while src/classification/mod.rs uses lazy_static!; convert
those pattern modules to use lazy_static! for uniform lazy initialization. For
each pattern module that currently declares statics like "static NAME:
Lazy<Type> = Lazy::new(|| ...)", replace the import once_cell::sync::Lazy with
"use lazy_static::lazy_static;" and change the declaration to "lazy_static! {
static ref NAME: Type = ...; }", preserving the initializer body and ensuring
types match (e.g., HashMap, Regex); remove once_cell imports and usages, keep
the existing lazy_static dependency in Cargo.toml, and mirror the same pattern
used in src/classification/mod.rs so all modules use lazy_static! consistently.
In `@docs/src/classification.md`:
- Around line 96-97: Update the documentation reference in
docs/src/classification.md: replace the incorrect integration test path
`tests/classification_integration_tests.rs` with the actual test file name
`tests/classification_integration.rs` so the docs point to the correct
integration test.
In `@mise.toml`:
- Line 10: The file defines mdbook twice; remove the duplicate backend entry so
only one key remains (either "cargo:mdbook" or "mdbook"). Locate the two keys
"cargo:mdbook" and "mdbook" in the mise.toml and delete the one you do not want
to use (prefer keeping "cargo:mdbook" if you need the cargo backend), ensuring
the remaining key retains its version string.
- Around line 18-22: The manifest uses "latest" for the claude and just entries
which breaks reproducible builds; update the claude and just keys to fixed
semantic versions (e.g., replace claude = "latest" and just = "latest" with
explicit version strings) so builds are deterministic, ensure the chosen
versions are compatible with the other listed tools (cyclonedx, git-cliff,
goreleaser), and run any dependency lock/update step your repo uses after
changing the versions to commit the resolved state.
- Line 9: The manifest pins a pre-release "cargo:cargo-nextest" = "0.9.123-b.4";
update it to the stable release by replacing the value with "0.9.123" (i.e.,
change the "cargo:cargo-nextest" entry to use the stable 0.9.123 version) to
avoid using an undocumented pre-release.
In `@src/classification/mod.rs`:
- Around line 75-86: Mark both public structs with #[non_exhaustive] to follow
the guideline: add #[non_exhaustive] above SemanticClassifier and
RegexCacheAddresses. For RegexCacheAddresses, make the fields private (remove
pub on guid, email, base64, format, user_agent) and add a public constructor pub
fn new(guid: usize, email: usize, base64: usize, format: usize, user_agent:
usize) -> Self that returns Self { guid, email, base64, format, user_agent },
keeping the existing derives; update any call sites to use
RegexCacheAddresses::new instead of struct literal field initialization.
In `@src/types.rs`:
- Around line 295-308: Add stability guarantees to the public API type
StringContext by marking the struct with #[non_exhaustive] and adding an
explicit constructor similar to ContainerInfo; specifically, annotate the
StringContext definition with #[non_exhaustive] and implement impl StringContext
{ pub fn new(section_type: SectionType, section_name: Option<String>,
binary_format: BinaryFormat, encoding: Encoding, source: StringSource) -> Self {
... } } so callers must use the constructor and future fields can be added
without breaking changes.
In `@tests/classification_integration_tests.rs`:
- Around line 15-27: The test helper create_test_context constructs
StringContext via a struct literal which will break if StringContext becomes
#[non_exhaustive]; change create_test_context to call the provided constructor
(e.g. StringContext::new) instead of using the literal, passing section_type,
Some(".rodata".to_string()) (or the constructor's section_name param),
binary_format, Encoding::Ascii, and source so the helper remains compatible with
future additions to StringContext.
- Around line 153-179: The performance test test_classification_performance
currently asserts elapsed < Duration::from_secs(2) for ~3600 samples which is
inconsistent with the other integration test's 500ms budget; update this test to
use the same timeout strategy (e.g., change the assertion to use
Duration::from_millis(500) or adjust the samples count to match that budget) or
add a brief inline comment near test_classification_performance explaining why
this test uses a longer 2s timeout for CI stability; ensure you touch the
Duration::from_secs(2) assertion and/or the samples construction and keep
references to SemanticClassifier::new(), classifier.classify(...), and
regex_cache_addresses() intact.
Cargo.toml
Outdated
| lazy_static = "1.5.0" | ||
| once_cell = "1.21.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check which lazy initialization pattern is used where
echo "=== lazy_static! usage ==="
rg -n 'lazy_static!' --type rust
echo ""
echo "=== once_cell::sync::Lazy usage ==="
rg -n 'once_cell::sync::Lazy' --type rust
echo ""
echo "=== once_cell re-exports (lazy_static or Lazy) in module level ==="
rg -n 'once_cell' --type rust | head -20
echo ""
echo "=== Cargo.toml dependencies around lines 25-35 ==="
sed -n '25,35p' Cargo.tomlRepository: EvilBit-Labs/Stringy
Length of output: 1115
Inconsistent lazy initialization patterns.
The codebase mixes two lazy initialization approaches: lazy_static! is used in src/classification/mod.rs, while once_cell::sync::Lazy is used across four pattern modules. Standardize on one approach for consistency.
Given that lazy_static is listed as a core dependency in the coding guidelines and is actively used, keep it. Migrate pattern modules from once_cell::sync::Lazy to lazy_static! to maintain uniform lazy initialization patterns.
🤖 Prompt for AI Agents
In `@Cargo.toml` around lines 27 - 28, Inconsistency: some pattern modules use
once_cell::sync::Lazy while src/classification/mod.rs uses lazy_static!; convert
those pattern modules to use lazy_static! for uniform lazy initialization. For
each pattern module that currently declares statics like "static NAME:
Lazy<Type> = Lazy::new(|| ...)", replace the import once_cell::sync::Lazy with
"use lazy_static::lazy_static;" and change the declaration to "lazy_static! {
static ref NAME: Type = ...; }", preserving the initializer body and ensuring
types match (e.g., HashMap, Regex); remove once_cell imports and usages, keep
the existing lazy_static dependency in Cargo.toml, and mirror the same pattern
used in src/classification/mod.rs so all modules use lazy_static! consistently.
| - Unit tests: `tests/classification_tests.rs` | ||
| - Integration tests: `tests/classification_integration_tests.rs` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix integration test path in docs.
Line 97 references tests/classification_integration_tests.rs, but the test file in this PR is tests/classification_integration.rs.
Proposed doc fix
-- Integration tests: `tests/classification_integration_tests.rs`
+- Integration tests: `tests/classification_integration.rs`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - Unit tests: `tests/classification_tests.rs` | |
| - Integration tests: `tests/classification_integration_tests.rs` | |
| - Unit tests: `tests/classification_tests.rs` | |
| - Integration tests: `tests/classification_integration.rs` |
🤖 Prompt for AI Agents
In `@docs/src/classification.md` around lines 96 - 97, Update the documentation
reference in docs/src/classification.md: replace the incorrect integration test
path `tests/classification_integration_tests.rs` with the actual test file name
`tests/classification_integration.rs` so the docs point to the correct
integration test.
| "cargo:cargo-deny" = "0.19.0" | ||
| "cargo:cargo-dist" = "0.30.3" | ||
| "cargo:cargo-llvm-cov" = "0.6.24" | ||
| "cargo:cargo-nextest" = "0.9.123-b.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the mise.toml file
cat -n mise.toml | head -20Repository: EvilBit-Labs/Stringy
Length of output: 952
🏁 Script executed:
# Search for cargo-nextest version information in the codebase
rg "cargo-nextest" -A 2 -B 2Repository: EvilBit-Labs/Stringy
Length of output: 1101
🌐 Web query:
cargo-nextest 0.9.123 pre-release version
💡 Result:
cargo-nextest 0.9.123 had beta (pre-release) builds:
- 0.9.123-b.1 (2026-01-18): introduces experimental record & replay test runs. [1]
- 0.9.123-b.2 (2026-01-18): improvements to record-replay (storage + display). [1]
- 0.9.123-b.3 (2026-01-21): adds rerun functionality on top of record-replay. [1]
If you want the pre-release binaries, use the “release URL” scheme and put the exact pre-release in {version} (for example 0.9.123-b.3):
https://get.nexte.st/{version}/{platform} [2]
Example (Linux x86_64):
curl -LsSf https://get.nexte.st/0.9.123-b.3/linux | tar zxf - -C ${CARGO_HOME:-~/.cargo}/binNote: the stable cargo-nextest 0.9.123 release was published on 2026-01-23. [3]
Sources: [1] [2] [3]
Use stable cargo-nextest 0.9.123 instead of pre-release
Line 9 pins a pre-release version (0.9.123-b.4). The stable 0.9.123 release is available (2026-01-23). For a binary analysis tool, prefer the stable release to avoid unnecessary instability. Additionally, only beta versions b.1–b.3 are documented; verify that 0.9.123-b.4 actually exists.
🤖 Prompt for AI Agents
In `@mise.toml` at line 9, The manifest pins a pre-release "cargo:cargo-nextest" =
"0.9.123-b.4"; update it to the stable release by replacing the value with
"0.9.123" (i.e., change the "cargo:cargo-nextest" entry to use the stable
0.9.123 version) to avoid using an undocumented pre-release.
| "cargo:cargo-dist" = "0.30.3" | ||
| "cargo:cargo-llvm-cov" = "0.6.24" | ||
| "cargo:cargo-nextest" = "0.9.123-b.4" | ||
| "cargo:mdbook" = "0.5.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Avoid duplicate mdbook definitions
Line 10 and Line 24 define mdbook twice via different backends. This can lead to ambiguous tool resolution or double installs. Prefer a single source of truth (either cargo:mdbook or mdbook) and remove the other.
Example (keep cargo backend)
-mdbook = "0.5.2"Also applies to: 24-24
🤖 Prompt for AI Agents
In `@mise.toml` at line 10, The file defines mdbook twice; remove the duplicate
backend entry so only one key remains (either "cargo:mdbook" or "mdbook").
Locate the two keys "cargo:mdbook" and "mdbook" in the mise.toml and delete the
one you do not want to use (prefer keeping "cargo:mdbook" if you need the cargo
backend), ensuring the remaining key retains its version string.
mise.toml
Outdated
| claude = "latest" | ||
| cyclonedx = "0.29.2" | ||
| git-cliff = "2.11.0" | ||
| goreleaser = "2.13.3" | ||
| just = "latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find and read the mise.toml file
find . -name "mise.toml" -type fRepository: EvilBit-Labs/Stringy
Length of output: 75
🏁 Script executed:
# Once located, read the file with line numbers
cat -n mise.toml | head -30Repository: EvilBit-Labs/Stringy
Length of output: 1410
🏁 Script executed:
# Check for any Unicode punctuation in mise.toml
file mise.toml
hexdump -C mise.toml | grep -E "e2|e3|e4|e8|e9|ea|eb|ef" | head -20Repository: EvilBit-Labs/Stringy
Length of output: 152
Pin explicit versions for claude and just to maintain deterministic builds
Lines 18 and 22 use "latest" for claude and just, which breaks build reproducibility across environments and time. Pin specific versions.
Example fix
-claude = "latest"
+claude = "X.Y.Z"
...
-just = "latest"
+just = "X.Y.Z"🤖 Prompt for AI Agents
In `@mise.toml` around lines 18 - 22, The manifest uses "latest" for the claude
and just entries which breaks reproducible builds; update the claude and just
keys to fixed semantic versions (e.g., replace claude = "latest" and just =
"latest" with explicit version strings) so builds are deterministic, ensure the
chosen versions are compatible with the other listed tools (cyclonedx,
git-cliff, goreleaser), and run any dependency lock/update step your repo uses
after changing the versions to commit the resolved state.
| #[derive(Debug, Default)] | ||
| pub struct SemanticClassifier; | ||
|
|
||
| #[doc(hidden)] | ||
| #[derive(Debug, Clone, Copy, PartialEq, Eq)] | ||
| pub struct RegexCacheAddresses { | ||
| pub guid: usize, | ||
| pub email: usize, | ||
| pub base64: usize, | ||
| pub format: usize, | ||
| pub user_agent: usize, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
fd -t f "mod.rs" "src/classification"Repository: EvilBit-Labs/Stringy
Length of output: 124
🏁 Script executed:
cat -n src/classification/mod.rsRepository: EvilBit-Labs/Stringy
Length of output: 14179
Add #[non_exhaustive] to public structs.
Both SemanticClassifier and RegexCacheAddresses are public and must be marked #[non_exhaustive] per coding guidelines. Additionally, RegexCacheAddresses should provide an explicit constructor instead of exposing fields directly.
Proposed change
+#[non_exhaustive]
#[derive(Debug, Default)]
pub struct SemanticClassifier;
#[doc(hidden)]
+#[non_exhaustive]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct RegexCacheAddresses {🤖 Prompt for AI Agents
In `@src/classification/mod.rs` around lines 75 - 86, Mark both public structs
with #[non_exhaustive] to follow the guideline: add #[non_exhaustive] above
SemanticClassifier and RegexCacheAddresses. For RegexCacheAddresses, make the
fields private (remove pub on guid, email, base64, format, user_agent) and add a
public constructor pub fn new(guid: usize, email: usize, base64: usize, format:
usize, user_agent: usize) -> Self that returns Self { guid, email, base64,
format, user_agent }, keeping the existing derives; update any call sites to use
RegexCacheAddresses::new instead of struct literal field initialization.
…e deprecated semantic module Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@benches/classification.rs`:
- Around line 16-22: The benchmark named bench_regex_compilation currently only
calls SemanticClassifier::new, which doesn't force lazy_static regex
compilation; either rename the benchmark to reflect it's measuring the
constructor (e.g., classification_constructor) or change its body to trigger
regex initialization explicitly (call a method that uses the regex such as
SemanticClassifier::classify or call the module's regex initializer so
lazy_static/once_cell runs) so the measured work matches the benchmark name;
apply the same change to the duplicate at lines 124-135 as well.
♻️ Duplicate comments (2)
src/classification/mod.rs (1)
75-86: Add #[non_exhaustive] and a constructor for public structs.Both SemanticClassifier and RegexCacheAddresses are public and should be non_exhaustive. RegexCacheAddresses should also expose a constructor so fields are not the public API surface. This is required for API stability. As per coding guidelines.
Proposed fix
+#[non_exhaustive] #[derive(Debug, Default)] pub struct SemanticClassifier; #[doc(hidden)] +#[non_exhaustive] #[derive(Debug, Clone, Copy, PartialEq, Eq)] pub struct RegexCacheAddresses { - pub guid: usize, - pub email: usize, - pub base64: usize, - pub format: usize, - pub user_agent: usize, + guid: usize, + email: usize, + base64: usize, + format: usize, + user_agent: usize, } + +impl RegexCacheAddresses { + #[must_use] + pub fn new(guid: usize, email: usize, base64: usize, format: usize, user_agent: usize) -> Self { + Self { + guid, + email, + base64, + format, + user_agent, + } + } +}pub fn regex_cache_addresses(&self) -> RegexCacheAddresses { - RegexCacheAddresses { - guid: &*GUID_REGEX as *const Regex as usize, - email: &*EMAIL_REGEX as *const Regex as usize, - base64: &*BASE64_REGEX as *const Regex as usize, - format: &*FORMAT_REGEX as *const Regex as usize, - user_agent: &*USER_AGENT_REGEX as *const Regex as usize, - } + RegexCacheAddresses::new( + &*GUID_REGEX as *const Regex as usize, + &*EMAIL_REGEX as *const Regex as usize, + &*BASE64_REGEX as *const Regex as usize, + &*FORMAT_REGEX as *const Regex as usize, + &*USER_AGENT_REGEX as *const Regex as usize, + ) }Also applies to: 178-187
tests/classification_integration_tests.rs (1)
148-174: Avoid hard wall clock assertions in tests.This can be flaky across CI runners. Prefer moving the performance check to Criterion or mark it ignored for CI. As per coding guidelines.
Proposed fix
-#[test] -fn test_classification_performance() { +#[test] +#[ignore] +fn test_classification_performance() {
benches/classification.rs
Outdated
| fn bench_regex_compilation(c: &mut Criterion) { | ||
| c.bench_function("classification_regex_compilation", |b| { | ||
| b.iter(|| { | ||
| let _ = SemanticClassifier::new(); | ||
| }); | ||
| }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Rename or rework the regex compilation benchmark.
SemanticClassifier::new does not trigger lazy_static regex compilation, so this benchmark measures a trivial constructor, not regex compilation. Rename it or trigger regex use explicitly so the measurement matches its name.
Proposed fix
-fn bench_regex_compilation(c: &mut Criterion) {
- c.bench_function("classification_regex_compilation", |b| {
+fn bench_classifier_init(c: &mut Criterion) {
+ c.bench_function("classification_classifier_init", |b| {
b.iter(|| {
let _ = SemanticClassifier::new();
});
});
} criterion_group!(
classification_benches,
- bench_regex_compilation,
+ bench_classifier_init,
bench_guid_classification,
bench_email_classification,
bench_base64_classification,
bench_format_string_classification,
bench_user_agent_classification,
bench_batch_classification,
bench_worst_case,
bench_context_creation
);Also applies to: 124-135
🤖 Prompt for AI Agents
In `@benches/classification.rs` around lines 16 - 22, The benchmark named
bench_regex_compilation currently only calls SemanticClassifier::new, which
doesn't force lazy_static regex compilation; either rename the benchmark to
reflect it's measuring the constructor (e.g., classification_constructor) or
change its body to trigger regex initialization explicitly (call a method that
uses the regex such as SemanticClassifier::classify or call the module's regex
initializer so lazy_static/once_cell runs) so the measured work matches the
benchmark name; apply the same change to the duplicate at lines 124-135 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
docs/src/classification.md
Outdated
| let classifier = SemanticClassifier::new(); | ||
| let found_string = FoundString { | ||
| text: "C:\\Windows\\System32\\cmd.exe".to_string(), | ||
| let context = StringContext { | ||
| section_type: SectionType::StringData, | ||
| section_name: Some(".rodata".to_string()), | ||
| binary_format: BinaryFormat::Elf, |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example constructs StringContext with a struct literal, but StringContext is #[non_exhaustive], so external users cannot instantiate it this way. Update the example to use StringContext::new(...) plus builder methods (e.g., with_section_name) so the docs compile and match the intended construction pattern.
| pub mod patterns; | ||
| pub mod ranking; | ||
| pub mod semantic; | ||
| pub mod symbols; |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub mod semantic was removed, which breaks users importing from stringy::classification::semantic::* (this module was previously public). If you want to preserve source compatibility, consider reintroducing a small semantic module that re-exports SemanticClassifier (and any still-supported helpers), and optionally deprecate it.
…ing and regex caching Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/classification/patterns/network.rs (1)
27-57: Domain detection misses newer TLDs due to incomplete allowlist.
has_valid_tldonly accepts domains in COMMON_TLDS, causing false negatives for valid but unlisted TLDs (e.g.,.systems,.network,.photography). For a binary analysis tool, this reduces detection coverage for command-and-control infrastructure and other malicious domains. Consider a fallback heuristic: if a TLD is 2-24 characters and all alphabetic, treat it as valid rather than rejecting it outright.Suggested fallback
if let Some(dot_pos) = domain.rfind('.') { let tld = &domain[dot_pos + 1..]; let tld_lower = tld.to_lowercase(); - COMMON_TLDS.contains(tld_lower.as_str()) + if COMMON_TLDS.contains(tld_lower.as_str()) { + true + } else { + (2..=24).contains(&tld.len()) && tld.chars().all(|c| c.is_ascii_alphabetic()) + } } else { false }
🤖 Fix all issues with AI agents
In `@src/types/error.rs`:
- Around line 12-44: Replace the flat String-based ParseError variants and
manual From impls by adding dedicated error-typed variants on the StringyError
enum and deriving #[from] where appropriate so original error sources are
preserved: change ParseError(String) to ParseError(#[from] goblin::error::Error)
and add variants like PeliteError(#[from] pelite::Error) and
ResourceFindError(#[from] pelite::resources::FindError) (or similarly-named
variants) and remove the manual impl From<...> blocks for goblin and pelite;
keep other variants (EncodingError, ConfigError, MemoryMapError) as-is and keep
the Result<T> = std::result::Result<T, StringyError> alias.
In `@tests/classification_integration_tests.rs`:
- Around line 24-105: Update the three tests test_elf_string_classification,
test_pe_string_classification, and test_macho_string_classification to record
insta snapshots of the classifier results instead of only asserting individual
tags: after creating SemanticClassifier::new(), creating the context via
create_test_context(...), and calling classifier.classify(...), collect the
returned tag sets (for each input string like guid, email, format_string,
base64, user_agent) into a serializable structure (e.g., Vec or HashSet of Tag
names) and replace or augment the assert!(tags.contains(...)) checks with insta
snapshot assertions (e.g., insta::assert_yaml_snapshot!) to persist the expected
classification outputs for each fixture-driven test; keep at least one explicit
assert per test if desired but ensure snapshotting uses the classifier variable
and context used in each test (symbols: SemanticClassifier::new,
create_test_context, classifier.classify, and the test functions named above).
In `@tests/classification_tests.rs`:
- Around line 177-226: Extract repeated FoundString construction into a small
helper (e.g., make_found_string or build_found_string) used by
test_classify_found_string_backward_compatibility: the helper should accept
parameters used in the diffs (string content, Encoding, offset, length,
StringSource, optional section) and return a FoundString (created via
FoundString::new and .with_section when needed); replace the three inline
FoundString::new(...) constructions in the test with calls to this helper and
keep calls to classifier.classify_found_string(&found_...) unchanged to preserve
assertions for Tag::Guid, Tag::Email, and Tag::FormatString.
♻️ Duplicate comments (2)
tests/classification_integration_tests.rs (1)
148-170: Avoid hard wall-clock thresholds in unit tests.The
elapsed < Duration::from_secs(2)assertion is prone to CI variance. This was already flagged in prior review; consider moving it to benchmarks, marking as ignored, or loosening/removing the timing check.src/classification/mod.rs (1)
75-77: Add #[non_exhaustive] to public SemanticClassifier.Public structs should be #[non_exhaustive] for API stability. This was already flagged in prior review. As per coding guidelines.
Proposed change
-#[derive(Debug, Default)] +#[non_exhaustive] +#[derive(Debug, Default)] pub struct SemanticClassifier;
| #[error("Binary parsing error: {0}")] | ||
| ParseError(String), | ||
|
|
||
| #[error("Invalid encoding in string at offset {offset}")] | ||
| EncodingError { offset: u64 }, | ||
|
|
||
| #[error("Configuration error: {0}")] | ||
| ConfigError(String), | ||
|
|
||
| #[error("Memory mapping error: {0}")] | ||
| MemoryMapError(String), | ||
| } | ||
|
|
||
| /// Result type alias for the stringy library | ||
| pub type Result<T> = std::result::Result<T, StringyError>; | ||
|
|
||
| impl From<goblin::error::Error> for StringyError { | ||
| fn from(err: goblin::error::Error) -> Self { | ||
| StringyError::ParseError(err.to_string()) | ||
| } | ||
| } | ||
|
|
||
| impl From<pelite::Error> for StringyError { | ||
| fn from(err: pelite::Error) -> Self { | ||
| StringyError::ParseError(err.to_string()) | ||
| } | ||
| } | ||
|
|
||
| impl From<pelite::resources::FindError> for StringyError { | ||
| fn from(err: pelite::resources::FindError) -> Self { | ||
| StringyError::ParseError(format!("Resource lookup error: {}", err)) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Preserve parse error sources instead of flattening to String.
Converting goblin and pelite errors to String loses the error chain and makes debugging harder. Prefer dedicated variants with #[from] to preserve sources. Based on coding guidelines.
Proposed refactor to preserve sources
pub enum StringyError {
@@
- #[error("Binary parsing error: {0}")]
- ParseError(String),
+ #[error("Goblin parse error: {0}")]
+ GoblinError(#[from] goblin::error::Error),
+
+ #[error("PE parse error: {0}")]
+ PeliteError(#[from] pelite::Error),
+
+ #[error("PE resource lookup error: {0}")]
+ PeliteResourceError(#[from] pelite::resources::FindError),
@@
-impl From<goblin::error::Error> for StringyError {
- fn from(err: goblin::error::Error) -> Self {
- StringyError::ParseError(err.to_string())
- }
-}
-
-impl From<pelite::Error> for StringyError {
- fn from(err: pelite::Error) -> Self {
- StringyError::ParseError(err.to_string())
- }
-}
-
-impl From<pelite::resources::FindError> for StringyError {
- fn from(err: pelite::resources::FindError) -> Self {
- StringyError::ParseError(format!("Resource lookup error: {}", err))
- }
-}🤖 Prompt for AI Agents
In `@src/types/error.rs` around lines 12 - 44, Replace the flat String-based
ParseError variants and manual From impls by adding dedicated error-typed
variants on the StringyError enum and deriving #[from] where appropriate so
original error sources are preserved: change ParseError(String) to
ParseError(#[from] goblin::error::Error) and add variants like
PeliteError(#[from] pelite::Error) and ResourceFindError(#[from]
pelite::resources::FindError) (or similarly-named variants) and remove the
manual impl From<...> blocks for goblin and pelite; keep other variants
(EncodingError, ConfigError, MemoryMapError) as-is and keep the Result<T> =
std::result::Result<T, StringyError> alias.
| #[test] | ||
| fn test_elf_string_classification() { | ||
| let fixture_path = get_fixture_path("test_binary_elf"); | ||
| let elf_data = fs::read(&fixture_path) | ||
| .expect("Failed to read ELF fixture. Run the build script to generate fixtures."); | ||
|
|
||
| assert!(ElfParser::detect(&elf_data), "ELF detection should succeed"); | ||
| let parser = ElfParser::new(); | ||
| let container_info = parser.parse(&elf_data).expect("Failed to parse ELF"); | ||
|
|
||
| assert_eq!(container_info.format, BinaryFormat::Elf); | ||
|
|
||
| let classifier = SemanticClassifier::new(); | ||
| let context = create_test_context( | ||
| BinaryFormat::Elf, | ||
| SectionType::StringData, | ||
| StringSource::SectionData, | ||
| ); | ||
|
|
||
| let guid = "{12345678-1234-1234-1234-123456789abc}"; | ||
| let tags = classifier.classify(guid, &context); | ||
| assert!(tags.contains(&Tag::Guid)); | ||
|
|
||
| let email = "admin@malware.com"; | ||
| let tags = classifier.classify(email, &context); | ||
| assert!(tags.contains(&Tag::Email)); | ||
|
|
||
| let format_string = "Error: %s at line %d"; | ||
| let tags = classifier.classify(format_string, &context); | ||
| assert!(tags.contains(&Tag::FormatString)); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_pe_string_classification() { | ||
| let fixture_path = get_fixture_path("test_binary_pe.exe"); | ||
| let pe_data = fs::read(&fixture_path) | ||
| .expect("Failed to read PE fixture. Run the build script to generate fixtures."); | ||
|
|
||
| assert!(PeParser::detect(&pe_data), "PE detection should succeed"); | ||
| let parser = PeParser::new(); | ||
| let container_info = parser.parse(&pe_data).expect("Failed to parse PE"); | ||
|
|
||
| assert_eq!(container_info.format, BinaryFormat::Pe); | ||
|
|
||
| let classifier = SemanticClassifier::new(); | ||
| let context = create_test_context( | ||
| BinaryFormat::Pe, | ||
| SectionType::Resources, | ||
| StringSource::ResourceString, | ||
| ); | ||
|
|
||
| let base64 = "U29tZSBsb25nZXIgYmFzZTY0IHN0cmluZw=="; | ||
| let tags = classifier.classify(base64, &context); | ||
| assert!(tags.contains(&Tag::Base64)); | ||
|
|
||
| let user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"; | ||
| let tags = classifier.classify(user_agent, &context); | ||
| assert!(tags.contains(&Tag::UserAgent)); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_macho_string_classification() { | ||
| let fixture_path = get_fixture_path("test_binary_macho"); | ||
| let macho_data = fs::read(&fixture_path) | ||
| .expect("Failed to read Mach-O fixture. Run the build script to generate fixtures."); | ||
|
|
||
| assert!( | ||
| MachoParser::detect(&macho_data), | ||
| "Mach-O detection should succeed" | ||
| ); | ||
| let parser = MachoParser::new(); | ||
| let container_info = parser.parse(&macho_data).expect("Failed to parse Mach-O"); | ||
|
|
||
| assert_eq!(container_info.format, BinaryFormat::MachO); | ||
|
|
||
| let classifier = SemanticClassifier::new(); | ||
| let context = create_test_context( | ||
| BinaryFormat::MachO, | ||
| SectionType::StringData, | ||
| StringSource::SectionData, | ||
| ); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Check if the test file exists and view its structure
fd "classification_integration_tests.rs" tests/
# Count lines to understand file size
wc -l tests/classification_integration_tests.rs 2>/dev/null || echo "File not found"
# Check for any insta usage in the test file
rg "insta" tests/classification_integration_tests.rs 2>/dev/null || echo "No insta found"
# Look for existing snapshot files
fd "\.snap" tests/ 2>/dev/null | head -20 || echo "No snapshots found"
# Check the test structure more broadly
rg "#\[test\]" tests/classification_integration_tests.rs -A 2 2>/dev/null || echo "No test macros found"Repository: EvilBit-Labs/Stringy
Length of output: 2004
🏁 Script executed:
# Let me examine the test file to understand current structure
head -50 tests/classification_integration_tests.rs
# Check if there are any imports of insta in the file
rg "use insta" tests/classification_integration_tests.rs
# Check if the snapshot file is referenced or used
rg "classification_snapshots\|assert_debug_snapshot\|assert_snapshot" tests/classification_integration_tests.rs
# Look at the full test file to understand the structure
wc -l tests/classification_integration_tests.rs
cat tests/classification_integration_tests.rsRepository: EvilBit-Labs/Stringy
Length of output: 7547
🏁 Script executed:
# Check the existing snapshot file to understand the pattern
cat tests/snapshots/classification_integration__classification_snapshots.snap
# Verify if other test files in the repository use insta snapshots
rg "assert_debug_snapshot\|assert_snapshot" tests/ -l | head -5
# Check one of the output integration tests to see the insta usage pattern
rg "assert_.*_snapshot" tests/output_json_integration_tests.rs -B 2 -A 2 | head -30Repository: EvilBit-Labs/Stringy
Length of output: 733
🏁 Script executed:
# Check if there's a different classification_integration.rs file
fd "classification_integration\.rs" tests/
# Look for where the snapshot is actually used
rg "classification_snapshots" tests/
# Check git history or structure to understand the relationship
git log --oneline tests/classification_integration* 2>/dev/null | head -10
# Check if insta is even in the dependencies
rg "insta" Cargo.toml tests/Cargo.toml 2>/dev/null | head -20
# Look at the output_json test to see insta usage pattern
fd "output_json_integration" tests/ -type fRepository: EvilBit-Labs/Stringy
Length of output: 491
🏁 Script executed:
# Look at the actual classification_integration.rs file that uses snapshots
cat tests/classification_integration.rs
# Compare file sizes
wc -l tests/classification_integration.rs tests/classification_integration_tests.rs
# Check imports in classification_integration.rs
head -20 tests/classification_integration.rsRepository: EvilBit-Labs/Stringy
Length of output: 4011
Add insta snapshots for fixture-driven tests.
Tests loading binary fixtures must use insta snapshots per coding guidelines. The three ELF/PE/Mach-O format tests (lines 24-105) load fixtures from tests/fixtures/ but verify classifications only with assert!(tags.contains(...)). Snapshot the classified tag sets per format to capture expected behavior and prevent regressions.
Example refactor for ELF test
use std::fs;
use std::time::{Duration, Instant};
+use insta::assert_debug_snapshot;
use stringy::classification::SemanticClassifier;
use stringy::container::{ContainerParser, ElfParser, MachoParser, PeParser};
use stringy::types::{BinaryFormat, Encoding, SectionType, StringContext, StringSource, Tag};
@@
let guid = "{12345678-1234-1234-1234-123456789abc}";
let tags = classifier.classify(guid, &context);
assert!(tags.contains(&Tag::Guid));
+ assert_debug_snapshot!("elf_guid_tags", tags);
let email = "admin@malware.com";
let tags = classifier.classify(email, &context);
assert!(tags.contains(&Tag::Email));
+ assert_debug_snapshot!("elf_email_tags", tags);🤖 Prompt for AI Agents
In `@tests/classification_integration_tests.rs` around lines 24 - 105, Update the
three tests test_elf_string_classification, test_pe_string_classification, and
test_macho_string_classification to record insta snapshots of the classifier
results instead of only asserting individual tags: after creating
SemanticClassifier::new(), creating the context via create_test_context(...),
and calling classifier.classify(...), collect the returned tag sets (for each
input string like guid, email, format_string, base64, user_agent) into a
serializable structure (e.g., Vec or HashSet of Tag names) and replace or
augment the assert!(tags.contains(...)) checks with insta snapshot assertions
(e.g., insta::assert_yaml_snapshot!) to persist the expected classification
outputs for each fixture-driven test; keep at least one explicit assert per test
if desired but ensure snapshotting uses the classifier variable and context used
in each test (symbols: SemanticClassifier::new, create_test_context,
classifier.classify, and the test functions named above).
| #[test] | ||
| fn test_classify_found_string_backward_compatibility() { | ||
| let classifier = SemanticClassifier::new(); | ||
|
|
||
| // Test GUID classification via FoundString | ||
| let found_guid = FoundString::new( | ||
| "{12345678-1234-1234-1234-123456789abc}".to_string(), | ||
| Encoding::Ascii, | ||
| 0, | ||
| 38, | ||
| StringSource::SectionData, | ||
| ) | ||
| .with_section(".rodata".to_string()); | ||
|
|
||
| let tags = classifier.classify_found_string(&found_guid); | ||
| assert!( | ||
| tags.contains(&Tag::Guid), | ||
| "GUID should be detected via classify_found_string" | ||
| ); | ||
|
|
||
| // Test email classification via FoundString | ||
| let found_email = FoundString::new( | ||
| "admin@example.com".to_string(), | ||
| Encoding::Ascii, | ||
| 100, | ||
| 17, | ||
| StringSource::SectionData, | ||
| ); | ||
|
|
||
| let tags = classifier.classify_found_string(&found_email); | ||
| assert!( | ||
| tags.contains(&Tag::Email), | ||
| "Email should be detected via classify_found_string" | ||
| ); | ||
|
|
||
| // Test format string classification via FoundString | ||
| let found_format = FoundString::new( | ||
| "Error: %s at line %d".to_string(), | ||
| Encoding::Ascii, | ||
| 200, | ||
| 20, | ||
| StringSource::SectionData, | ||
| ); | ||
|
|
||
| let tags = classifier.classify_found_string(&found_format); | ||
| assert!( | ||
| tags.contains(&Tag::FormatString), | ||
| "Format string should be detected via classify_found_string" | ||
| ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Factor repeated FoundString construction into a helper.
Multiple tests hand-build FoundString instances. A small helper reduces duplication and keeps length/offset usage consistent.
Example helper
use stringy::types::{
BinaryFormat, Encoding, FoundString, SectionType, StringContext, StringSource, Tag,
};
@@
+fn make_found_string(text: &str, offset: u64, length: u32) -> FoundString {
+ FoundString::new(
+ text.to_string(),
+ Encoding::Ascii,
+ offset,
+ length,
+ StringSource::SectionData,
+ )
+}
@@
- let found_guid = FoundString::new(
- "{12345678-1234-1234-1234-123456789abc}".to_string(),
- Encoding::Ascii,
- 0,
- 38,
- StringSource::SectionData,
- )
- .with_section(".rodata".to_string());
+ let found_guid = make_found_string(
+ "{12345678-1234-1234-1234-123456789abc}",
+ 0,
+ 38,
+ )
+ .with_section(".rodata".to_string());🤖 Prompt for AI Agents
In `@tests/classification_tests.rs` around lines 177 - 226, Extract repeated
FoundString construction into a small helper (e.g., make_found_string or
build_found_string) used by test_classify_found_string_backward_compatibility:
the helper should accept parameters used in the diffs (string content, Encoding,
offset, length, StringSource, optional section) and return a FoundString
(created via FoundString::new and .with_section when needed); replace the three
inline FoundString::new(...) constructions in the test with calls to this helper
and keep calls to classifier.classify_found_string(&found_...) unchanged to
preserve assertions for Tag::Guid, Tag::Email, and Tag::FormatString.
…terns, and file structure Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Add missing `command: zigbuild` for cross-compilation in goreleaser - Remove `rustup default stable` hook (toolchain pinned via rust-toolchain.toml) - Replace lazy_static with once_cell::sync::Lazy for consistency - Remove lazy_static dependency from Cargo.toml - Fix documentation to use StringContext::new() constructor pattern - Add missing semantic categories (URLs, domains, IPs, paths) to docs - Remove flaky wall-clock timing assertion from integration tests - Rename misleading benchmark from regex_compilation to classifier_construction - Remove duplicate mdbook definition and pin just version in mise.toml Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@AGENTS.md`:
- Around line 88-91: Replace the bold label "**New semantic tag**:" with a
proper Markdown heading (e.g., "### New semantic tag") or a list item to satisfy
MD036; keep the rest of the sentence ("Add variant to `Tag` enum in
`types/mod.rs`, implement pattern in `classification/patterns/` or
`classification/mod.rs`") unchanged and ensure the heading level aligns with
surrounding headings in AGENTS.md so references to the Tag enum (`Tag`),
types/mod.rs, and classification/patterns/ or classification/mod.rs remain
clear.
In `@benches/classification.rs`:
- Around line 84-102: In bench_batch_classification replace the mixed-format
string "Error %s at line {}" so it matches the classifier's documented patterns
(use % specifiers or numeric {digits}); update the format! call that populates
samples (inside the loop that builds samples) to use a numeric placeholder like
"Error %s at line {0}" (still passing index), so the sample vector contains the
intended %-style pattern and the classifier.classify calls
(classifier.classify(..., &context)) will see the correct mix of formats.
In `@docs/src/classification.md`:
- Around line 13-17: The inline code span for the URL regex under the "### URLs"
section contains an unescaped backtick and a stray trailing backslash which
breaks rendering; update the code span for the pattern (the string starting with
`https?://[^\s<>"{}|\\\^\[\]\`\]+\`) to use a double-backtick code span and
remove the trailing backslash so the backtick inside the character class is
preserved and the pattern renders correctly.
♻️ Duplicate comments (7)
.goreleaser.yaml (1)
11-16: Pin cargo-zigbuild in the pre-build hook.
cargo installwithout--versionmakes release builds non-deterministic and can drift from the toolchain pinned inmise.toml. Align the hook to a fixed version.Proposed fix
- - cargo install --locked cargo-zigbuild + - cargo install --locked --version 0.21.2 cargo-zigbuild#!/bin/bash # Confirm the pinned cargo-zigbuild version in mise.toml rg -n '"cargo:cargo-zigbuild"' mise.tomlmise.toml (2)
9-9: Use stable cargo-nextest instead of a pre-release.
The pre-release pin increases instability and may not be published. Prefer the stable release.Proposed fix
-"cargo:cargo-nextest" = "0.9.123-b.4" +"cargo:cargo-nextest" = "0.9.123"What is the latest stable cargo-nextest version, and does 0.9.123-b.4 exist?
18-18: Pin claude to a fixed version for reproducible toolchains.
Using "latest" makes local and CI environments drift over time.Proposed fix
-claude = "latest" +claude = "X.Y.Z"What is the latest stable version string for the claude CLI (mise tool)?src/classification/mod.rs (2)
40-76: Use lazy_static! for regex caching in classification modules.
This module usesonce_cell::sync::Lazy, but the project rules requirelazy_static!for regex caches. Convert the statics and ensurelazy_staticis present in Cargo.toml (removeonce_cellhere if no longer needed). As per coding guidelines, align classification regex caching to lazy_static.Proposed fix
-use once_cell::sync::Lazy; +use lazy_static::lazy_static; use regex::Regex; -static GUID_REGEX: Lazy<Regex> = Lazy::new(|| { - Regex::new(r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$") - .expect("Invalid GUID regex") -}); - -static EMAIL_REGEX: Lazy<Regex> = Lazy::new(|| { - Regex::new(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$").expect("Invalid email regex") -}); - -static BASE64_REGEX: Lazy<Regex> = - Lazy::new(|| Regex::new(r"^[A-Za-z0-9+/]{20,}={0,2}$").expect("Invalid base64 regex")); - -static FORMAT_REGEX: Lazy<Regex> = - Lazy::new(|| Regex::new(r"%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}").expect("Invalid format regex")); - -static USER_AGENT_REGEX: Lazy<Regex> = Lazy::new(|| { - Regex::new(r"(Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+)") - .expect("Invalid user agent regex") -}); +lazy_static! { + static ref GUID_REGEX: Regex = Regex::new( + r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$" + ) + .expect("Invalid GUID regex"); + + static ref EMAIL_REGEX: Regex = + Regex::new(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$") + .expect("Invalid email regex"); + + static ref BASE64_REGEX: Regex = + Regex::new(r"^[A-Za-z0-9+/]{20,}={0,2}$").expect("Invalid base64 regex"); + + static ref FORMAT_REGEX: Regex = + Regex::new(r"%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}").expect("Invalid format regex"); + + static ref USER_AGENT_REGEX: Regex = Regex::new( + r"(Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+)" + ) + .expect("Invalid user agent regex"); +}#!/bin/bash # Identify lazy_static vs once_cell usage in classification modules rg -n "lazy_static!|once_cell::sync::Lazy" src/classification
78-80: Mark SemanticClassifier as #[non_exhaustive].
This is a public struct; add#[non_exhaustive]to keep the API forward-compatible. As per coding guidelines, public structs should be non-exhaustive.Proposed fix
+#[non_exhaustive] #[derive(Debug, Default)] pub struct SemanticClassifier;tests/classification_integration_tests.rs (1)
23-112: Add insta snapshots for fixture-driven classification tests.
These tests load fixtures but only assert individual tags. The test guidelines require snapshot coverage for fixture-driven outputs to prevent regressions. As per coding guidelines, addinstasnapshots for the tag sets.Proposed fix (example for ELF test)
+use insta::assert_debug_snapshot; @@ let guid = "{12345678-1234-1234-1234-123456789abc}"; let tags = classifier.classify(guid, &context); assert!(tags.contains(&Tag::Guid)); + assert_debug_snapshot!("elf_guid_tags", tags);docs/src/classification.md (1)
124-133: Fix integration test path reference.The integration test file is tests/classification_integration.rs, not tests/classification_integration_tests.rs.
Proposed fix
-- Integration tests: `tests/classification_integration_tests.rs` +- Integration tests: `tests/classification_integration.rs`
| ## Adding Features | ||
|
|
||
| **New semantic tag**: Add variant to `Tag` enum in `types.rs`, implement pattern in `classification/semantic.rs` | ||
| **New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix MD036 by using a heading instead of bold text.
Markdownlint reports emphasis used as a heading. Convert the bold label to a heading or list item.
Proposed fix
-**New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`
+### New semantic tag
+Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Adding Features | |
| **New semantic tag**: Add variant to `Tag` enum in `types.rs`, implement pattern in `classification/semantic.rs` | |
| **New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs` | |
| ## Adding Features | |
| ### New semantic tag | |
| Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs` |
🤖 Prompt for AI Agents
In `@AGENTS.md` around lines 88 - 91, Replace the bold label "**New semantic
tag**:" with a proper Markdown heading (e.g., "### New semantic tag") or a list
item to satisfy MD036; keep the rest of the sentence ("Add variant to `Tag` enum
in `types/mod.rs`, implement pattern in `classification/patterns/` or
`classification/mod.rs`") unchanged and ensure the heading level aligns with
surrounding headings in AGENTS.md so references to the Tag enum (`Tag`),
types/mod.rs, and classification/patterns/ or classification/mod.rs remain
clear.
| fn bench_batch_classification(c: &mut Criterion) { | ||
| let classifier = SemanticClassifier::new(); | ||
| let context = make_context(); | ||
|
|
||
| let mut samples = Vec::new(); | ||
| for index in 0..1000 { | ||
| samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index)); | ||
| samples.push(format!("user{}@example.com", index)); | ||
| samples.push(format!("Error %s at line {}", index)); | ||
| } | ||
|
|
||
| c.bench_function("classification_batch", |b| { | ||
| b.iter(|| { | ||
| for sample in &samples { | ||
| let _ = classifier.classify(black_box(sample.as_str()), &context); | ||
| } | ||
| }); | ||
| }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Batch sample misses a format-string match.
"Error %s at line {}" uses {} which does not match the documented format-string patterns that rely on % specifiers or {digits}. This reduces the intended mix of categories in the batch set.
Proposed fix
- samples.push(format!("Error %s at line {}", index));
+ samples.push(format!("Error %s at line %d {}", index));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn bench_batch_classification(c: &mut Criterion) { | |
| let classifier = SemanticClassifier::new(); | |
| let context = make_context(); | |
| let mut samples = Vec::new(); | |
| for index in 0..1000 { | |
| samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index)); | |
| samples.push(format!("user{}@example.com", index)); | |
| samples.push(format!("Error %s at line {}", index)); | |
| } | |
| c.bench_function("classification_batch", |b| { | |
| b.iter(|| { | |
| for sample in &samples { | |
| let _ = classifier.classify(black_box(sample.as_str()), &context); | |
| } | |
| }); | |
| }); | |
| } | |
| fn bench_batch_classification(c: &mut Criterion) { | |
| let classifier = SemanticClassifier::new(); | |
| let context = make_context(); | |
| let mut samples = Vec::new(); | |
| for index in 0..1000 { | |
| samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index)); | |
| samples.push(format!("user{}@example.com", index)); | |
| samples.push(format!("Error %s at line %d {}", index)); | |
| } | |
| c.bench_function("classification_batch", |b| { | |
| b.iter(|| { | |
| for sample in &samples { | |
| let _ = classifier.classify(black_box(sample.as_str()), &context); | |
| } | |
| }); | |
| }); | |
| } |
🤖 Prompt for AI Agents
In `@benches/classification.rs` around lines 84 - 102, In
bench_batch_classification replace the mixed-format string "Error %s at line {}"
so it matches the classifier's documented patterns (use % specifiers or numeric
{digits}); update the format! call that populates samples (inside the loop that
builds samples) to use a numeric placeholder like "Error %s at line {0}" (still
passing index), so the sample vector contains the intended %-style pattern and
the classifier.classify calls (classifier.classify(..., &context)) will see the
correct mix of formats.
| ### URLs | ||
|
|
||
| #### URLs | ||
| - Pattern: `https?://[^\s<>"{}|\\\^\[\]\`\]+\` | ||
| - Examples: `https://example.com/path`, `http://malware.site/payload` | ||
| - Validation: Must start with `http://` or `https://` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URL pattern code span breaks on backtick.
The inline code span contains a backtick in the character class, which terminates the code span and renders incorrectly. Use a double-backtick code span and remove the stray trailing backslash.
Proposed fix
-- Pattern: `https?://[^\s<>"{}|\\\^\[\]\`\]+\`
+- Pattern: ``https?://[^\s<>"{}|\\\^\[\]\`]+``📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### URLs | |
| #### URLs | |
| - Pattern: `https?://[^\s<>"{}|\\\^\[\]\`\]+\` | |
| - Examples: `https://example.com/path`, `http://malware.site/payload` | |
| - Validation: Must start with `http://` or `https://` | |
| ### URLs | |
| - Pattern: ``https?://[^\s<>"{}|\\\^\[\]\`]+`` | |
| - Examples: `https://example.com/path`, `http://malware.site/payload` | |
| - Validation: Must start with `http://` or `https://` |
🤖 Prompt for AI Agents
In `@docs/src/classification.md` around lines 13 - 17, The inline code span for
the URL regex under the "### URLs" section contains an unescaped backtick and a
stray trailing backslash which breaks rendering; update the code span for the
pattern (the string starting with `https?://[^\s<>"{}|\\\^\[\]\`\]+\`) to use a
double-backtick code span and remove the trailing backslash so the backtick
inside the character class is preserved and the pattern renders correctly.
This pull request introduces a new benchmarking suite for the string classification system, updates documentation to reflect recent classifier changes, and improves build, dependency, and release tooling. The most notable changes are the addition of classification benchmarks, a major rewrite of the classification documentation, and enhanced GoReleaser and toolchain configuration.
Classification System Improvements:
benches/classification.rsto measure the performance of various string classification tasks, including GUIDs, emails, base64, format strings, user agents, batch classification, and worst-case scenarios. [1] [2]Documentation Updates:
docs/src/classification.mdto document the new classification pipeline, validation rules, semantic categories, performance notes, and testing instructions. The documentation now matches the current implementation and usage patterns.Build & Toolchain Enhancements:
.goreleaser.yamlconfiguration for cross-platform Rust builds, including pre-build hooks for Rust toolchain setup and cross-compilation withcargo-zigbuild.justfileGoReleaser and lint tasks, removing platform-specific duplication and switching tomarkdownlint-cli2. [1] [2]mise.tomltoolchain with additional Rust and documentation tools, and updated several tool versions for improved reproducibility. [1] [2]Internal Code Cleanup:
src/classification/patterns/mod.rsthat are no longer needed for classifier cache testing.