feat(classification): add semantic pattern matching for GUID, email, base64, format strings, and user agents #131

unclesp1d3r · 2026-01-25T06:07:40Z

This pull request introduces a new benchmarking suite for the string classification system, updates documentation to reflect recent classifier changes, and improves build, dependency, and release tooling. The most notable changes are the addition of classification benchmarks, a major rewrite of the classification documentation, and enhanced GoReleaser and toolchain configuration.

Classification System Improvements:

Added a new benchmark suite benches/classification.rs to measure the performance of various string classification tasks, including GUIDs, emails, base64, format strings, user agents, batch classification, and worst-case scenarios. [1] [2]
Updated the domain classifier to avoid double-tagging domains that are actually email addresses.

Documentation Updates:

Completely rewrote docs/src/classification.md to document the new classification pipeline, validation rules, semantic categories, performance notes, and testing instructions. The documentation now matches the current implementation and usage patterns.

Build & Toolchain Enhancements:

Added a .goreleaser.yaml configuration for cross-platform Rust builds, including pre-build hooks for Rust toolchain setup and cross-compilation with cargo-zigbuild.
Modernized the justfile GoReleaser and lint tasks, removing platform-specific duplication and switching to markdownlint-cli2. [1] [2]
Expanded the mise.toml toolchain with additional Rust and documentation tools, and updated several tool versions for improved reproducibility. [1] [2]

Internal Code Cleanup:

Removed unused regex re-exports from src/classification/patterns/mod.rs that are no longer needed for classifier cache testing.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

- Updated the `SemanticClassifier` to utilize a new `StringContext` struct for context-aware classification. - Enhanced classification capabilities for GUIDs, emails, Base64 strings, format strings, and user agents. - Improved regex patterns and validation logic to reduce false positives. - Consolidated path detection logic and removed redundant regex patterns. - Updated tests to cover new classification scenarios and ensure performance benchmarks. - Refactored integration tests to validate classification against various binary formats (ELF, PE, Mach-O). - Removed deprecated methods and cleaned up code for better readability and maintainability. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai · 2026-01-25T06:08:03Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Architectural & API Refactoring

This PR restructures the semantic classification system into a context-aware model and consolidates classification logic into src/classification/mod.rs (deleting the old semantic.rs). The classifier API moves away from a FoundString wrapper and toward a lightweight context-driven signature: SemanticClassifier::classify(&self, text: &str, context: &StringContext) plus a backwards-compatible classify_found_string(&self, &FoundString). apply_semantic_enrichment now accepts &ContainerInfo to construct and pass StringContext to the classifier.

Key Public API Changes

Added StringContext (non_exhaustive) with: section_type, optional section_name, binary_format, encoding, source; constructor StringContext::new(...) and with_section_name(...)
SemanticClassifier is now the primary entry in src/classification/mod.rs with:
- pub fn classify(&self, text: &str, context: &StringContext) -> Vec
- pub fn classify_found_string(&self, found: &crate::types::FoundString) -> Vec
- pub fn regex_cache_addresses(&self) -> RegexCacheAddresses (for testing)
New (doc(hidden) pub) RegexCacheAddresses to expose internal regex cache addresses for tests
src/types/mod.rs exposes StringContext and re-exports it from lib.rs
Removed semantic.rs (old classifier implementation) and many of its public re-exports

Binary Format & Section Handling

Classification decisions are now context-aware: section_type and binary_format (ELF/PE/Mach-O) are used to adjust validation/boosting rules (e.g., .rodata vs other sections).
apply_semantic_enrichment constructs a section_map from ContainerInfo, determines SectionType for each FoundString, builds a StringContext (attaching section_name when available), and passes the text + context to the classifier.
This enables more precise detection (e.g., GUIDs, format strings) when the binary section indicates string literal regions.

Semantic Classification Accuracy & Validation

New semantic pattern matching for GUID, Email, Base64, Format String, and User-Agent with explicit validation helpers (is_valid_guid, is_valid_email, is_valid_base64, is_valid_format_string, is_valid_user_agent).
Classification pipeline now separates pattern matching from validation, includes context-aware confidence boosting, length checks, entropy heuristics, and TLD validation for domains.
classify_domain now avoids double-tagging email-like inputs by early-exiting if input contains '@'.
Added Tag::Guid and explicitly documents/uses semantic categories: Guid, Email, Base64, FormatString, UserAgent, while retaining URL/Domain/IP/FilePath/Registry classification (with enhanced validation).
Backwards-compatible classify_found_string helper preserves older callsites where needed.

Performance & Reliability

Introduces lazy, once_cell-based regex caching for GUID, Email, Base64, FormatString, and UserAgent, exposed via regex_cache_addresses for testing.
Replaced unwrap() with explicit expect(...) messages across regex initializers (data.rs, ip.rs, network.rs, paths.rs) for clearer panic messages on invalid regexes.
New Criterion benchmarking suite benches/classification.rs covering classifier construction, per-pattern workloads (GUID, email, base64, format strings, user agents), batch classification (1000 samples), worst-case inputs, and context creation overhead.

Tooling, Docs & Build

Docs: Full rewrite of docs/src/classification.md to describe the new pipeline, validation rules, semantic categories, performance notes, and testing instructions.
Release/build: Added .goreleaser.yaml with pre-build hooks and cargo-zigbuild cross-compilation; updated justfile to reduce duplication and use markdownlint-cli2.
Development tooling: mise.toml expanded with numerous cargo:* and other tool pins (cargo-audit, cargo-deny, cargo-zigbuild, mdbook plugins, etc.).
Tests: New and updated tests and integration suites (tests/classification_tests.rs and tests/classification_integration_tests.rs) exercising GUID/Email/Base64/FormatString/UserAgent detection and context-sensitive behavior; existing tests updated to use StringContext.

Removals & Breaking Changes

Deleted src/classification/semantic.rs and removed older FoundString-first classifier implementation.
Classifier signature changed (breaking for callers): classify(&FoundString) → classify(&str, &StringContext). apply_semantic_enrichment signature updated to accept &ContainerInfo.
Patterns module: removed certain regex re-exports from src/classification/patterns/mod.rs (no longer publicly re-exported).
StringyError and Result moved into a dedicated src/types/error.rs (new error enum and From impls); types/mod.rs removed previous error definitions.
Consumers must construct and pass StringContext when calling classify; tests and extraction call sites updated accordingly.

Notes on Review Focus

Architectural impact: consolidation of classifier surface and movement to context-aware API; ensure no regressions from deleting semantic.rs and that classify_found_string remains compatible where needed.
API compatibility: callers must be updated to construct StringContext; confirm public re-exports in src/lib.rs expose the new types as intended.
Binary/section semantics: verify section_type mapping from ContainerInfo is correct for all extractor code paths and that boosting/validation heuristics are guarded to avoid false positives.
Classification correctness: review new validation logic (length, entropy, TLD checks) and regex patterns for GUID/email/base64/format/user-agent to ensure sensitivity/specificity trade-offs are acceptable.

Walkthrough

This PR rewrites and consolidates the string classification subsystem into a context-aware API. The classifier now accepts (text: &str, context: &StringContext), regex initializers use explicit expect messages, many pattern modules and tests are reorganized, and build/tooling/benchmarking infra (goreleaser, benches, mise, justfile) is added or updated.

Changes

Cohort / File(s)	Summary
Core classification `src/classification/mod.rs`, `src/classification/semantic.rs`	`semantic.rs` removed; classification logic consolidated into `mod.rs`. New public `SemanticClassifier` with `fn classify(&self, text: &str, context: &StringContext) -> Vec<Tag>` plus backward-compatible `classify_found_string`. Adds `regex_cache_addresses()` accessor and `RegexCacheAddresses` (doc(hidden)).
Patterns & regexes `src/classification/patterns/*` (e.g. `mod.rs`, `network.rs`, `paths.rs`, `data.rs`, `ip.rs`)	Removed some re-exports of regex constants; replaced `.unwrap()` with `.expect("Invalid ... regex")` on Lazy regex inits; `network.rs` domain classifier early-exits on email-like input and uses expect; `paths.rs` drops `REGISTRY_PATH_REGEX` re-export. New/updated GUID/EMAIL/BASE64/FORMAT/USER_AGENT regexes live under patterns.
Types / context / errors `src/types/mod.rs`, `src/types/error.rs`, `src/types/tests.rs`	New public `StringContext` (section_type, section_name, binary_format, encoding, source) with `new()` and `with_section_name()`. New error module `StringyError` and `Result<T>` alias moved to `src/types/error.rs`. Added FoundString serde unit tests.
Extraction integration `src/extraction/mod.rs`, `src/lib.rs`	`apply_semantic_enrichment` signature changed to accept `&ContainerInfo`; builds a section_map from container segments and constructs a `StringContext` per FoundString; `StringContext` re-exported from crate root.
Build & benchmark infra `.goreleaser.yaml`, `Cargo.toml`, `benches/classification.rs`	Adds `.goreleaser.yaml` (v2) with zig-based cross-builds and packaging rules; updates dependencies (`entropy`, `thiserror`), adds a `classification` bench target, and introduces `benches/classification.rs` with multiple Criterion benchmarks.
Tooling / task runner `justfile`, `mise.toml`	`justfile` swaps `markdownlint` → `markdownlint-cli2`, removes several goreleaser recipes and adds an annotated `goreleaser-build-target` arg. `mise.toml` reorganizes tools under `cargo:*` and adds many dev tools (cargo-audit, cargo-deny, cargo-zigbuild, mdbook plugins, zig, lychee, pipx:mdformat).
Tests & docs `docs/src/classification.md`, `tests/` (`classification.rs`, `classification_integration*.rs`)	Docs refocused toward GUID-centric examples and a validation step; tests updated/added to use `StringContext` and new classification API. New unit and integration test suites covering GUID, Email, Base64, FormatString, UserAgent, path/URL cases and batch/perf benches.

Sequence Diagram(s)

(omitted — changes are internal library refactors and tests; no multi-actor runtime sequence diagram applicable)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat(classification): add semantic classifier for URL and domain detection #118 — touches SemanticClassifier and URL/domain classification; likely overlaps with moved classifier logic.
feat(classification): implement file path classification for POSIX, Windows, and registry paths #121 — modifies path/file classification; likely touches the same pattern/path handling moved here.
feat(classification): add symbol demangling for Rust #124 — refactors pattern modules and regexes; likely overlaps with the new patterns layout and regex changes.

Poem

🧩 Context arrives, classifiers awake,
Regexes hardened, expectations take stake,
GUIDs and emails now seen with a view,
Benchmarks hum, toolchains refreshed too —
Small refactors, big clarity; ship it true.

🚥 Pre-merge checks | ✅ 6 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Error Handling	⚠️ Warning	StringyError is defined but never used; functions return panics via .expect() instead of Result<T, StringyError>, violating stated error handling conventions.	Update all fallible operations to return Result<T, StringyError> and replace .expect() calls with proper error propagation using the ? operator throughout classification and extraction modules.
Ascii Only	⚠️ Warning	PR introduces non-ASCII emojis (✅, 🚧) in src/lib.rs documentation comments, violating the ASCII-only requirement stated in AGENTS.md.	Remove emoji status indicators from src/lib.rs lines 57-67 and replace with ASCII equivalents such as [OK] or [WIP].

✅ Passed checks (6 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title follows Conventional Commits format with feat type and classification scope, clearly describing the semantic pattern matching features being added.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, covering classification improvements, documentation updates, build enhancements, and internal cleanup.
Docstring Coverage	✅ Passed	Docstring coverage is 92.86% which is sufficient. The required threshold is 70.00%.
No Unsafe Code	✅ Passed	No unsafe code found. The crate maintains #![forbid(unsafe_code)] at lib.rs top level with zero unsafe blocks or #[allow(unsafe_code)] attributes introduced.
File Size Limit	✅ Passed	All 8 new files comply with 500-line limit (max 247 lines), no blanket #[allow] attributes found, and pre-existing oversized file was minimally modified (+33/-5).
Section Weight Consistency	✅ Passed	Custom check for section weight consistency does not apply; no container parser files modified.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch 18-implement-semantic-pattern-matching-guid-email-base64-printf-format-and-user-agent-detection

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-25T06:12:01Z

Codecov Report

❌ Patch coverage is 92.33227% with 24 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/classification/mod.rs	93.67%	15 Missing ⚠️
src/types/error.rs	0.00%	9 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR expands Stringy’s semantic classification capabilities and tooling by introducing context-aware classification (via StringContext), adding new tag detectors (GUID/email/base64/format strings/user agents), and improving benchmarking/testing plus release/toolchain configs.

Changes:

Added StringContext and rewired semantic enrichment/classification to be context-aware across extraction and tests.
Implemented additional semantic detectors (GUID/email/base64/format strings/user agents) and adjusted domain tagging to avoid email double-tagging.
Added Criterion benchmarks and updated build/release/toolchain/docs to reflect the new classification pipeline.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`src/types.rs`	Adds `StringContext` to carry classification context metadata.
`src/lib.rs`	Re-exports `StringContext` as part of the public API.
`src/extraction/mod.rs`	Passes container-derived context into semantic enrichment/classification.
`src/classification/mod.rs`	Reworks `SemanticClassifier` into a context-aware classifier and adds new detectors.
`src/classification/patterns/network.rs`	Prevents domain tagging when input looks like an email address.
`src/classification/patterns/mod.rs`	Removes regex re-exports no longer needed for cache testing.
`src/classification/patterns/paths.rs`	Removes `REGISTRY_PATH_REGEX` and relies on validation logic for registry detection.
`src/classification/semantic.rs`	Updates imports, but now appears out-of-sync with the active classifier/module tree.
`tests/classification_tests.rs`	Adds unit tests for new detectors and context-aware behavior.
`tests/classification_integration_tests.rs`	Adds integration coverage across ELF/PE/Mach-O fixtures plus perf/cache checks.
`tests/classification_integration.rs`	Updates snapshot/perf tests to match the new classifier API and tags.
`tests/snapshots/classification_integration__classification_snapshots.snap`	Updates expected snapshot outputs for new tags.
`benches/classification.rs`	Adds Criterion benchmark suite for classification workloads.
`Cargo.toml`	Adds benchmark target and new dependency (`lazy_static`); bumps some versions.
`docs/src/classification.md`	Rewrites classification documentation for the new pipeline and tags.
`justfile`	Updates lint/docs tooling invocation and simplifies GoReleaser tasks.
`mise.toml`	Expands toolchain tools/versions and adds more build/documentation tooling.
`.goreleaser.yaml`	Adds GoReleaser config for cross-platform Rust builds and hooks for tool setup.

tests/classification_integration_tests.rs

src/classification/semantic.rs

Copilot · 2026-01-25T06:17:10Z

src/classification/mod.rs

+    #[must_use]
+    pub fn classify(&self, text: &str, context: &StringContext) -> Vec<Tag> {
+        let mut tags = Vec::new();
+


SemanticClassifier::classify has been changed to require (&str, &StringContext), which is a breaking change for any downstream code that previously called classify(&FoundString). If this is intended to be backward-compatible, consider restoring the old classify(&FoundString) signature as a thin wrapper (and optionally deprecate it) while introducing a new name for the context-aware API (e.g., classify_with_context).

Copilot · 2026-01-25T06:17:10Z

src/classification/mod.rs

+use lazy_static::lazy_static;
+use regex::Regex;
+
+use crate::types::{BinaryFormat, SectionType, StringContext, StringSource, Tag};
+
+pub mod patterns;
 pub mod ranking;
-pub mod semantic;
 pub mod symbols;

 pub use ranking::{RankingConfig, RankingEngine};
-pub use semantic::SemanticClassifier;
 pub use symbols::SymbolDemangler;
+
+// Import pattern classification functions
+use patterns::{
+    classify_domain, classify_ip_addresses, classify_posix_path, classify_registry_path,
+    classify_unc_path, classify_url, classify_windows_path,
+};
+
+lazy_static! {
+    static ref GUID_REGEX: Regex = Regex::new(
+        r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$",
+    )
+    .expect("Invalid GUID regex");


This module introduces lazy_static for regex caching, but the rest of the codebase uses once_cell::sync::Lazy for the same purpose (e.g., src/classification/patterns/network.rs:6-22, src/classification/patterns/paths.rs:6-24). Consider switching these regex statics to once_cell::sync::Lazy and dropping the extra lazy_static dependency to keep conventions and dependencies consistent.

Copilot · 2026-01-25T06:17:10Z

docs/src/classification.md

 ## Semantic Categories

-### Network Indicators
+### GUIDs

-#### URLs
+- Pattern: `\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}`
+- Examples: `{12345678-1234-1234-1234-123456789abc}`
+- Validation: Strict format compliance with braces required

- **Pattern**: `` https?://[^\s<>"{}|\\^\[\]\`]+ ``
- **Examples**: `https://api.example.com/v1/users`, `http://malware.com/payload`
- **Validation**: URL format check with safe character filtering
- **Security relevance**: High - indicates network communication
+### Email Addresses

-#### Domain Names
+- Pattern: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
+- Examples: `admin@malware.com`, `user.name+tag@example.co.uk`
+- Validation: Single `@`, valid TLD length and characters, no empty parts

- **Pattern**: `\b(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}\b`
- **Examples**: `api.example.com`, `malware-c2.net`
- **Validation**: TLD checking, DNS format compliance
- **Security relevance**: High - C2 domains, legitimate services
+### Base64 Data

-#### IP Addresses
+- Pattern: `[A-Za-z0-9+/]{20,}={0,2}`
+- Examples: `U29tZSBsb25nZXIgYmFzZTY0IHN0cmluZw==`
+- Validation: Length >= 20, length divisible by 4, padding rules, entropy threshold

- **IPv4 Pattern**: `\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b`
- **IPv6 Pattern**: Comprehensive pattern supporting full notation, compressed notation (`::1`), and mixed notation (`::ffff:192.0.2.1`)
- **Examples**: `192.168.1.1`, `2001:db8::1`, `[::1]:8080`
- **Validation**: Two-stage validation using regex pre-filter followed by `std::net::IpAddr` parsing for correctness
- **Port Handling**: IP addresses with ports (e.g., `192.168.1.1:8080`) are supported by automatically stripping the port suffix before validation
- **IPv6 Bracket Handling**: Bracketed IPv6 addresses (e.g., `[::1]` and `[::1]:8080`) are supported
- **False Positive Mitigation**: Version numbers like `1.2.3.4` are accepted as IPv4 addresses by design
- **Implementation**: See `src/classification/semantic.rs` for the complete implementation
- **Security relevance**: High - infrastructure indicators
+### Format Strings

-### File System Indicators
+- Pattern: `%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}`
+- Examples: `Error: %s at line %d`, `User {0} logged in`
+- Validation: Reasonable specifier count, context-aware thresholds

-#### File Paths
+### User Agents

- **POSIX Pattern**: `^/[^\0\n\r]*`
- **Windows Pattern**: `^[A-Za-z]:\\[^\0\n\r]*`
- **UNC Pattern**: `^\\\\[a-zA-Z0-9.-]+\\[^\0\n\r]*`
- **Examples**: `/usr/bin/malware`, `C:\\Windows\\System32\\evil.dll`, `\\\\server\\share\\file.txt`
- **Validation rules**: Rejects null bytes, newlines, carriage returns; rejects consecutive path separators in POSIX paths (`//`) and consecutive backslashes in Windows paths (for example, `folder\\\\file.txt`), while allowing UNC paths that start with `\\\\`; applies a reasonable length limit (4096 max, stricter for unknown prefixes); POSIX paths must be absolute (start with `/`); Windows paths must use backslashes and a valid drive letter
- **Suspicious path examples**: `/etc/cron.d/`, `/etc/init.d/`, `/usr/local/bin/`, `/tmp/`, `/var/tmp/`; `C:\\Windows\\System32\\`, `C:\\Windows\\Temp\\`, `...\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Startup\\`
- **Security relevance**: Medium-High - persistence and execution locations
-
-#### Registry Paths
-
- **Full root pattern**: `^HKEY_[A-Z_]+\\[^\0\n\r]*`
- **Abbreviated root pattern**: `^HK(LM|CU|CR|U|CC)\\[^\0\n\r]*`
- **Supported root keys**:
-  - `HKEY_LOCAL_MACHINE`
-  - `HKEY_CURRENT_USER`
-  - `HKEY_CLASSES_ROOT`
-  - `HKEY_USERS`
-  - `HKEY_CURRENT_CONFIG`
- **Supported abbreviations**:
-  - `HKLM`, `HKCU`, `HKCR`, `HKU`, `HKCC`
- **Suspicious registry paths**:
-  - `\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Run`
-  - `\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\RunOnce`
-  - `\\System\\CurrentControlSet\\Services`
-  - `\\SOFTWARE\\Microsoft\\Windows NT\\CurrentVersion\\Winlogon`
- **Examples**:
-  - `HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Run`
-  - `HKCU\\Software\\Microsoft`
- **Security relevance**: High - persistence mechanisms
-
-### Identifiers
-
-#### GUIDs/UUIDs
-
- **Pattern**: `\{?[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}?`
- **Examples**: `{12345678-1234-1234-1234-123456789abc}`, `12345678-1234-1234-1234-123456789abc`
- **Validation**: Format compliance
- **Security relevance**: Medium - component identification
-
-#### Email Addresses
-
- **Pattern**: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
- **Examples**: `admin@malware.com`, `support@legitimate.org`
- **Validation**: Basic format validation
- **Security relevance**: Medium - contact information
-
-### Code Artifacts
-
-#### Format Strings
-
- **Pattern**: `%[-+0 #]*(\d+|\*)?(\.(\d+|\*))?(hh?|ll?|[Lzjt])?[diouxXeEfFgGaAcspn%]`
- **Examples**: `Error: %s at line %d`, `Name: %s, Age: %d, Score: %.2f`
- **Context**: Presence of real format specifiers (%% alone is ignored)
- **Security relevance**: Low-Medium - debugging information
-
-#### Base64 Data
-
- **Pattern**: Character set validation with padding rules
- **Examples**: `SGVsbG8gV29ybGQ=`
- **Validation**: Length >= 16, Base64 character set, valid padding, reject length mod 4 of 1
- **Security relevance**: Variable - encoded payloads
-
-#### User Agents
-
- **Pattern**: Prefix match for common agents (Mozilla, curl, Wget, python-requests, libwww-perl, Java, Apache-HttpClient, okhttp, PostmanRuntime)
- **Examples**: `Mozilla/5.0 (Windows NT 10.0; Win64; x64)`, `curl/7.68.0`
- **Security relevance**: Medium - network fingerprinting
-
-## Tag Specificity
-
-Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk.
+- Pattern: `Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+`
+- Examples: `Mozilla/5.0 (Windows NT 10.0; Win64; x64)`, `Chrome/117.0.5938.92`
+- Validation: Known browser identifiers and minimum length


The rewritten documentation lists only GUID/email/base64/format-string/user-agent categories, but the classifier still supports URL/domain/IP/file-path/registry-path tagging (see src/classification/mod.rs capability list and implementation). Please update this doc to include the still-supported tag categories so it matches the current behavior.

Copilot · 2026-01-25T06:17:11Z

.goreleaser.yaml

+before:
+  hooks:
+    # if you don't do these things before calling goreleaser, it might be a
+    # good idea to do them here:
+    - rustup default stable
+    - cargo install --locked cargo-zigbuild
+    - cargo fetch --locked


The before.hooks steps have reproducibility/side-effect concerns: rustup default stable mutates the global toolchain (and the repo already pins a toolchain via rust-toolchain.toml), and cargo install --locked cargo-zigbuild does not pin a cargo-zigbuild version. Consider removing the rustup default step and pinning cargo-zigbuild with --version (or relying on a preinstalled/pinned tool) to make release builds deterministic.

Copilot · 2026-01-25T06:17:11Z

mise.toml

+git-cliff                   = "2.11.0"
+goreleaser                  = "2.13.3"
+just                        = "latest"
+markdownlint-cli2           = "0.20.0"


just = "latest" reduces toolchain reproducibility compared to the previous pinned version, which can make CI/local behavior drift over time. If reproducible builds are a goal, consider pinning just to a specific version here (similar to other tools).

Copilot · 2026-01-25T06:17:11Z

Cargo.toml

 [dependencies]
 clap           = { version = "4.5.54", features = [ "derive" ] }
 cpp_demangle   = "0.5.1"
-entropy        = "0.4.2"
+entropy        = "0.4.3"
 goblin         = "0.10.4"
+lazy_static    = "1.5.0"
 once_cell      = "1.21.3"
 pelite         = "0.10.0"
 regex          = "1.12.2"


lazy_static was added as a new dependency, but existing regex statics in this repo use once_cell::sync::Lazy. If you switch the new classifier regex statics to once_cell, you can likely remove lazy_static here to avoid an extra dependency and keep the project consistent.

src/extraction/mod.rs

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/classification/patterns/network.rs (1)
114-141: Add test coverage for email exclusion from domain classification.

The existing tests cover URL exclusion but not the new email exclusion behavior.
🧪 Suggested test addition
 #[test]
 fn test_domain_valid_and_invalid() {
     assert!(classify_domain("example.com").is_some());
     assert!(classify_domain("https://example.com").is_none());
     assert!(classify_domain("invalid.xyz123").is_none());
 }

+#[test]
+fn test_email_not_tagged_as_domain() {
+    // Email addresses should not be tagged as domains
+    assert!(classify_domain("user@example.com").is_none());
+    assert!(classify_domain("admin@subdomain.example.org").is_none());
+}
src/lib.rs (1)
69-73: Add crate-level safety attributes to the library root.

The #![forbid(unsafe_code)] and #![deny(warnings)] attributes are missing from src/lib.rs. These must be present at the crate root before any module declarations. Add them immediately after the crate documentation block and before the pub mod statements:
#![forbid(unsafe_code)]
#![deny(warnings)]

pub mod classification;
tests/classification_integration.rs (1)
50-72: Avoid timing-based assertion in unit tests.
Line 72 enforces a 500ms budget, which is CI and load sensitive and can cause flaky failures. Gate this behind an env flag or move it to benchmarks.
Proposed gating to avoid flaky CI runs
 #[test]
 fn test_classification_performance() {
+    if std::env::var("STRINGY_PERF_TESTS").is_err() {
+        return;
+    }
     let classifier = SemanticClassifier::new();
     ...
     assert!(elapsed < Duration::from_millis(500));
 }
As per coding guidelines, avoid timing-dependent assertions in tests.

🤖 Fix all issues with AI agents

In @.goreleaser.yaml:
- Around line 30-38: The rust builder block (builder: rust with id: cross) is
missing the required command for cargo-zigbuild; update that block to add
"command: zigbuild" so GoReleaser uses cargo-zigbuild instead of the default
cargo build when cross-compiling (ensuring targets like x86_64-unknown-linux-gnu
and aarch64-unknown-linux-gnu are built correctly).

In `@Cargo.toml`:
- Around line 27-28: Inconsistency: some pattern modules use
once_cell::sync::Lazy while src/classification/mod.rs uses lazy_static!; convert
those pattern modules to use lazy_static! for uniform lazy initialization. For
each pattern module that currently declares statics like "static NAME:
Lazy<Type> = Lazy::new(|| ...)", replace the import once_cell::sync::Lazy with
"use lazy_static::lazy_static;" and change the declaration to "lazy_static! {
static ref NAME: Type = ...; }", preserving the initializer body and ensuring
types match (e.g., HashMap, Regex); remove once_cell imports and usages, keep
the existing lazy_static dependency in Cargo.toml, and mirror the same pattern
used in src/classification/mod.rs so all modules use lazy_static! consistently.

In `@docs/src/classification.md`:
- Around line 96-97: Update the documentation reference in
docs/src/classification.md: replace the incorrect integration test path
`tests/classification_integration_tests.rs` with the actual test file name
`tests/classification_integration.rs` so the docs point to the correct
integration test.

In `@mise.toml`:
- Line 10: The file defines mdbook twice; remove the duplicate backend entry so
only one key remains (either "cargo:mdbook" or "mdbook"). Locate the two keys
"cargo:mdbook" and "mdbook" in the mise.toml and delete the one you do not want
to use (prefer keeping "cargo:mdbook" if you need the cargo backend), ensuring
the remaining key retains its version string.
- Around line 18-22: The manifest uses "latest" for the claude and just entries
which breaks reproducible builds; update the claude and just keys to fixed
semantic versions (e.g., replace claude = "latest" and just = "latest" with
explicit version strings) so builds are deterministic, ensure the chosen
versions are compatible with the other listed tools (cyclonedx, git-cliff,
goreleaser), and run any dependency lock/update step your repo uses after
changing the versions to commit the resolved state.
- Line 9: The manifest pins a pre-release "cargo:cargo-nextest" = "0.9.123-b.4";
update it to the stable release by replacing the value with "0.9.123" (i.e.,
change the "cargo:cargo-nextest" entry to use the stable 0.9.123 version) to
avoid using an undocumented pre-release.

In `@src/classification/mod.rs`:
- Around line 75-86: Mark both public structs with #[non_exhaustive] to follow
the guideline: add #[non_exhaustive] above SemanticClassifier and
RegexCacheAddresses. For RegexCacheAddresses, make the fields private (remove
pub on guid, email, base64, format, user_agent) and add a public constructor pub
fn new(guid: usize, email: usize, base64: usize, format: usize, user_agent:
usize) -> Self that returns Self { guid, email, base64, format, user_agent },
keeping the existing derives; update any call sites to use
RegexCacheAddresses::new instead of struct literal field initialization.

In `@src/types.rs`:
- Around line 295-308: Add stability guarantees to the public API type
StringContext by marking the struct with #[non_exhaustive] and adding an
explicit constructor similar to ContainerInfo; specifically, annotate the
StringContext definition with #[non_exhaustive] and implement impl StringContext
{ pub fn new(section_type: SectionType, section_name: Option<String>,
binary_format: BinaryFormat, encoding: Encoding, source: StringSource) -> Self {
... } } so callers must use the constructor and future fields can be added
without breaking changes.

In `@tests/classification_integration_tests.rs`:
- Around line 15-27: The test helper create_test_context constructs
StringContext via a struct literal which will break if StringContext becomes
#[non_exhaustive]; change create_test_context to call the provided constructor
(e.g. StringContext::new) instead of using the literal, passing section_type,
Some(".rodata".to_string()) (or the constructor's section_name param),
binary_format, Encoding::Ascii, and source so the helper remains compatible with
future additions to StringContext.
- Around line 153-179: The performance test test_classification_performance
currently asserts elapsed < Duration::from_secs(2) for ~3600 samples which is
inconsistent with the other integration test's 500ms budget; update this test to
use the same timeout strategy (e.g., change the assertion to use
Duration::from_millis(500) or adjust the samples count to match that budget) or
add a brief inline comment near test_classification_performance explaining why
this test uses a longer 2s timeout for CI stability; ensure you touch the
Duration::from_secs(2) assertion and/or the samples construction and keep
references to SemanticClassifier::new(), classifier.classify(...), and
regex_cache_addresses() intact.

.goreleaser.yaml

coderabbitai · 2026-01-25T06:17:38Z

Cargo.toml

+lazy_static    = "1.5.0"
 once_cell      = "1.21.3"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check which lazy initialization pattern is used where echo "=== lazy_static! usage ===" rg -n 'lazy_static!' --type rust echo "" echo "=== once_cell::sync::Lazy usage ===" rg -n 'once_cell::sync::Lazy' --type rust echo "" echo "=== once_cell re-exports (lazy_static or Lazy) in module level ===" rg -n 'once_cell' --type rust | head -20 echo "" echo "=== Cargo.toml dependencies around lines 25-35 ===" sed -n '25,35p' Cargo.toml

Repository: EvilBit-Labs/Stringy

Length of output: 1115

Inconsistent lazy initialization patterns.

The codebase mixes two lazy initialization approaches: lazy_static! is used in src/classification/mod.rs, while once_cell::sync::Lazy is used across four pattern modules. Standardize on one approach for consistency.

Given that lazy_static is listed as a core dependency in the coding guidelines and is actively used, keep it. Migrate pattern modules from once_cell::sync::Lazy to lazy_static! to maintain uniform lazy initialization patterns.

🤖 Prompt for AI Agents

In `@Cargo.toml` around lines 27 - 28, Inconsistency: some pattern modules use once_cell::sync::Lazy while src/classification/mod.rs uses lazy_static!; convert those pattern modules to use lazy_static! for uniform lazy initialization. For each pattern module that currently declares statics like "static NAME: Lazy<Type> = Lazy::new(|| ...)", replace the import once_cell::sync::Lazy with "use lazy_static::lazy_static;" and change the declaration to "lazy_static! { static ref NAME: Type = ...; }", preserving the initializer body and ensuring types match (e.g., HashMap, Regex); remove once_cell imports and usages, keep the existing lazy_static dependency in Cargo.toml, and mirror the same pattern used in src/classification/mod.rs so all modules use lazy_static! consistently.

coderabbitai · 2026-01-25T06:17:39Z

docs/src/classification.md

+- Unit tests: `tests/classification_tests.rs`
+- Integration tests: `tests/classification_integration_tests.rs`


⚠️ Potential issue | 🟡 Minor

Fix integration test path in docs.
Line 97 references tests/classification_integration_tests.rs, but the test file in this PR is tests/classification_integration.rs.

Proposed doc fix

-- Integration tests: `tests/classification_integration_tests.rs` +- Integration tests: `tests/classification_integration.rs`

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- Unit tests: `tests/classification_tests.rs`

- Integration tests: `tests/classification_integration_tests.rs`

- Unit tests: `tests/classification_tests.rs`

- Integration tests: `tests/classification_integration.rs`

🤖 Prompt for AI Agents

In `@docs/src/classification.md` around lines 96 - 97, Update the documentation reference in docs/src/classification.md: replace the incorrect integration test path `tests/classification_integration_tests.rs` with the actual test file name `tests/classification_integration.rs` so the docs point to the correct integration test.

coderabbitai · 2026-01-25T06:17:39Z

mise.toml

+"cargo:cargo-deny"          = "0.19.0"
+"cargo:cargo-dist"          = "0.30.3"
+"cargo:cargo-llvm-cov"      = "0.6.24"
+"cargo:cargo-nextest"       = "0.9.123-b.4"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the mise.toml file cat -n mise.toml | head -20

Repository: EvilBit-Labs/Stringy

Length of output: 952

🏁 Script executed:

# Search for cargo-nextest version information in the codebase rg "cargo-nextest" -A 2 -B 2

Repository: EvilBit-Labs/Stringy

Length of output: 1101

🌐 Web query:

cargo-nextest 0.9.123 pre-release version

💡 Result:

cargo-nextest 0.9.123 had beta (pre-release) builds:

0.9.123-b.1 (2026-01-18): introduces experimental record & replay test runs. [1]

0.9.123-b.2 (2026-01-18): improvements to record-replay (storage + display). [1]

0.9.123-b.3 (2026-01-21): adds rerun functionality on top of record-replay. [1]

If you want the pre-release binaries, use the “release URL” scheme and put the exact pre-release in {version} (for example 0.9.123-b.3):
https://get.nexte.st/{version}/{platform} [2]

Example (Linux x86_64):

curl -LsSf https://get.nexte.st/0.9.123-b.3/linux | tar zxf - -C ${CARGO_HOME:-~/.cargo}/bin

Note: the stable cargo-nextest 0.9.123 release was published on 2026-01-23. [3]

Sources: [1] [2] [3]

Use stable cargo-nextest 0.9.123 instead of pre-release

Line 9 pins a pre-release version (0.9.123-b.4). The stable 0.9.123 release is available (2026-01-23). For a binary analysis tool, prefer the stable release to avoid unnecessary instability. Additionally, only beta versions b.1–b.3 are documented; verify that 0.9.123-b.4 actually exists.

🤖 Prompt for AI Agents

In `@mise.toml` at line 9, The manifest pins a pre-release "cargo:cargo-nextest" = "0.9.123-b.4"; update it to the stable release by replacing the value with "0.9.123" (i.e., change the "cargo:cargo-nextest" entry to use the stable 0.9.123 version) to avoid using an undocumented pre-release.

coderabbitai · 2026-01-25T06:17:39Z

mise.toml

+"cargo:cargo-dist"          = "0.30.3"
+"cargo:cargo-llvm-cov"      = "0.6.24"
+"cargo:cargo-nextest"       = "0.9.123-b.4"
+"cargo:mdbook"              = "0.5.2"


🧹 Nitpick | 🔵 Trivial

Avoid duplicate mdbook definitions

Line 10 and Line 24 define mdbook twice via different backends. This can lead to ambiguous tool resolution or double installs. Prefer a single source of truth (either cargo:mdbook or mdbook) and remove the other.

Example (keep cargo backend)

-mdbook = "0.5.2"

Also applies to: 24-24

🤖 Prompt for AI Agents

In `@mise.toml` at line 10, The file defines mdbook twice; remove the duplicate backend entry so only one key remains (either "cargo:mdbook" or "mdbook"). Locate the two keys "cargo:mdbook" and "mdbook" in the mise.toml and delete the one you do not want to use (prefer keeping "cargo:mdbook" if you need the cargo backend), ensuring the remaining key retains its version string.

coderabbitai · 2026-01-25T06:17:39Z

mise.toml

+claude                      = "latest"
+cyclonedx                   = "0.29.2"
+git-cliff                   = "2.11.0"
+goreleaser                  = "2.13.3"
+just                        = "latest"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find and read the mise.toml file find . -name "mise.toml" -type f

Repository: EvilBit-Labs/Stringy

Length of output: 75

🏁 Script executed:

# Once located, read the file with line numbers cat -n mise.toml | head -30

Repository: EvilBit-Labs/Stringy

Length of output: 1410

🏁 Script executed:

# Check for any Unicode punctuation in mise.toml file mise.toml hexdump -C mise.toml | grep -E "e2|e3|e4|e8|e9|ea|eb|ef" | head -20

Repository: EvilBit-Labs/Stringy

Length of output: 152

Pin explicit versions for claude and just to maintain deterministic builds

Lines 18 and 22 use "latest" for claude and just, which breaks build reproducibility across environments and time. Pin specific versions.

Example fix

-claude = "latest" +claude = "X.Y.Z" ... -just = "latest" +just = "X.Y.Z"

🤖 Prompt for AI Agents

In `@mise.toml` around lines 18 - 22, The manifest uses "latest" for the claude and just entries which breaks reproducible builds; update the claude and just keys to fixed semantic versions (e.g., replace claude = "latest" and just = "latest" with explicit version strings) so builds are deterministic, ensure the chosen versions are compatible with the other listed tools (cyclonedx, git-cliff, goreleaser), and run any dependency lock/update step your repo uses after changing the versions to commit the resolved state.

coderabbitai · 2026-01-25T06:17:39Z

src/classification/mod.rs

+#[derive(Debug, Default)]
+pub struct SemanticClassifier;
+
+#[doc(hidden)]
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub struct RegexCacheAddresses {
+    pub guid: usize,
+    pub email: usize,
+    pub base64: usize,
+    pub format: usize,
+    pub user_agent: usize,
+}


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "mod.rs" "src/classification"

Repository: EvilBit-Labs/Stringy

Length of output: 124

🏁 Script executed:

cat -n src/classification/mod.rs

Repository: EvilBit-Labs/Stringy

Length of output: 14179

Add #[non_exhaustive] to public structs.
Both SemanticClassifier and RegexCacheAddresses are public and must be marked #[non_exhaustive] per coding guidelines. Additionally, RegexCacheAddresses should provide an explicit constructor instead of exposing fields directly.

Proposed change

+#[non_exhaustive] #[derive(Debug, Default)] pub struct SemanticClassifier; #[doc(hidden)] +#[non_exhaustive] #[derive(Debug, Clone, Copy, PartialEq, Eq)] pub struct RegexCacheAddresses {

🤖 Prompt for AI Agents

In `@src/classification/mod.rs` around lines 75 - 86, Mark both public structs with #[non_exhaustive] to follow the guideline: add #[non_exhaustive] above SemanticClassifier and RegexCacheAddresses. For RegexCacheAddresses, make the fields private (remove pub on guid, email, base64, format, user_agent) and add a public constructor pub fn new(guid: usize, email: usize, base64: usize, format: usize, user_agent: usize) -> Self that returns Self { guid, email, base64, format, user_agent }, keeping the existing derives; update any call sites to use RegexCacheAddresses::new instead of struct literal field initialization.

src/types/mod.rs

tests/classification_integration_tests.rs

…e deprecated semantic module Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@benches/classification.rs`:
- Around line 16-22: The benchmark named bench_regex_compilation currently only
calls SemanticClassifier::new, which doesn't force lazy_static regex
compilation; either rename the benchmark to reflect it's measuring the
constructor (e.g., classification_constructor) or change its body to trigger
regex initialization explicitly (call a method that uses the regex such as
SemanticClassifier::classify or call the module's regex initializer so
lazy_static/once_cell runs) so the measured work matches the benchmark name;
apply the same change to the duplicate at lines 124-135 as well.

♻️ Duplicate comments (2)

src/classification/mod.rs (1)

75-86: Add #[non_exhaustive] and a constructor for public structs.

Both SemanticClassifier and RegexCacheAddresses are public and should be non_exhaustive. RegexCacheAddresses should also expose a constructor so fields are not the public API surface. This is required for API stability. As per coding guidelines.

Proposed fix

+#[non_exhaustive]
 #[derive(Debug, Default)]
 pub struct SemanticClassifier;

 #[doc(hidden)]
+#[non_exhaustive]
 #[derive(Debug, Clone, Copy, PartialEq, Eq)]
 pub struct RegexCacheAddresses {
-    pub guid: usize,
-    pub email: usize,
-    pub base64: usize,
-    pub format: usize,
-    pub user_agent: usize,
+    guid: usize,
+    email: usize,
+    base64: usize,
+    format: usize,
+    user_agent: usize,
 }
+
+impl RegexCacheAddresses {
+    #[must_use]
+    pub fn new(guid: usize, email: usize, base64: usize, format: usize, user_agent: usize) -> Self {
+        Self {
+            guid,
+            email,
+            base64,
+            format,
+            user_agent,
+        }
+    }
+}

     pub fn regex_cache_addresses(&self) -> RegexCacheAddresses {
-        RegexCacheAddresses {
-            guid: &*GUID_REGEX as *const Regex as usize,
-            email: &*EMAIL_REGEX as *const Regex as usize,
-            base64: &*BASE64_REGEX as *const Regex as usize,
-            format: &*FORMAT_REGEX as *const Regex as usize,
-            user_agent: &*USER_AGENT_REGEX as *const Regex as usize,
-        }
+        RegexCacheAddresses::new(
+            &*GUID_REGEX as *const Regex as usize,
+            &*EMAIL_REGEX as *const Regex as usize,
+            &*BASE64_REGEX as *const Regex as usize,
+            &*FORMAT_REGEX as *const Regex as usize,
+            &*USER_AGENT_REGEX as *const Regex as usize,
+        )
     }

Also applies to: 178-187

tests/classification_integration_tests.rs (1)

148-174: Avoid hard wall clock assertions in tests.

This can be flaky across CI runners. Prefer moving the performance check to Criterion or mark it ignored for CI. As per coding guidelines.
Proposed fix
-#[test]
-fn test_classification_performance() {
+#[test]
+#[ignore]
+fn test_classification_performance() {

coderabbitai · 2026-01-25T18:21:09Z

benches/classification.rs

+fn bench_regex_compilation(c: &mut Criterion) {
+    c.bench_function("classification_regex_compilation", |b| {
+        b.iter(|| {
+            let _ = SemanticClassifier::new();
+        });
+    });
+}


🧹 Nitpick | 🔵 Trivial

Rename or rework the regex compilation benchmark.

SemanticClassifier::new does not trigger lazy_static regex compilation, so this benchmark measures a trivial constructor, not regex compilation. Rename it or trigger regex use explicitly so the measurement matches its name.

Proposed fix

-fn bench_regex_compilation(c: &mut Criterion) { - c.bench_function("classification_regex_compilation", |b| { +fn bench_classifier_init(c: &mut Criterion) { + c.bench_function("classification_classifier_init", |b| { b.iter(|| { let _ = SemanticClassifier::new(); }); }); }

criterion_group!( classification_benches, - bench_regex_compilation, + bench_classifier_init, bench_guid_classification, bench_email_classification, bench_base64_classification, bench_format_string_classification, bench_user_agent_classification, bench_batch_classification, bench_worst_case, bench_context_creation );

Also applies to: 124-135

🤖 Prompt for AI Agents

In `@benches/classification.rs` around lines 16 - 22, The benchmark named bench_regex_compilation currently only calls SemanticClassifier::new, which doesn't force lazy_static regex compilation; either rename the benchmark to reflect it's measuring the constructor (e.g., classification_constructor) or change its body to trigger regex initialization explicitly (call a method that uses the regex such as SemanticClassifier::classify or call the module's regex initializer so lazy_static/once_cell runs) so the measured work matches the benchmark name; apply the same change to the duplicate at lines 124-135 as well.

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Copilot · 2026-01-25T18:27:23Z

docs/src/classification.md

 let classifier = SemanticClassifier::new();
-let found_string = FoundString {
-    text: "C:\\Windows\\System32\\cmd.exe".to_string(),
+let context = StringContext {
+    section_type: SectionType::StringData,
+    section_name: Some(".rodata".to_string()),
+    binary_format: BinaryFormat::Elf,


This example constructs StringContext with a struct literal, but StringContext is #[non_exhaustive], so external users cannot instantiate it this way. Update the example to use StringContext::new(...) plus builder methods (e.g., with_section_name) so the docs compile and match the intended construction pattern.

Copilot · 2026-01-25T18:27:23Z

src/classification/mod.rs

+pub mod patterns;
 pub mod ranking;
-pub mod semantic;
 pub mod symbols;


pub mod semantic was removed, which breaks users importing from stringy::classification::semantic::* (this module was previously public). If you want to preserve source compatibility, consider reintroducing a small semantic module that re-exports SemanticClassifier (and any still-supported helpers), and optionally deprecate it.

…ing and regex caching Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/classification/patterns/network.rs (1)
27-57: Domain detection misses newer TLDs due to incomplete allowlist.

has_valid_tld only accepts domains in COMMON_TLDS, causing false negatives for valid but unlisted TLDs (e.g., .systems, .network, .photography). For a binary analysis tool, this reduces detection coverage for command-and-control infrastructure and other malicious domains. Consider a fallback heuristic: if a TLD is 2-24 characters and all alphabetic, treat it as valid rather than rejecting it outright.
Suggested fallback
 if let Some(dot_pos) = domain.rfind('.') {
     let tld = &domain[dot_pos + 1..];
     let tld_lower = tld.to_lowercase();
-    COMMON_TLDS.contains(tld_lower.as_str())
+    if COMMON_TLDS.contains(tld_lower.as_str()) {
+        true
+    } else {
+        (2..=24).contains(&tld.len()) && tld.chars().all(|c| c.is_ascii_alphabetic())
+    }
 } else {
     false
 }

🤖 Fix all issues with AI agents

In `@src/types/error.rs`:
- Around line 12-44: Replace the flat String-based ParseError variants and
manual From impls by adding dedicated error-typed variants on the StringyError
enum and deriving #[from] where appropriate so original error sources are
preserved: change ParseError(String) to ParseError(#[from] goblin::error::Error)
and add variants like PeliteError(#[from] pelite::Error) and
ResourceFindError(#[from] pelite::resources::FindError) (or similarly-named
variants) and remove the manual impl From<...> blocks for goblin and pelite;
keep other variants (EncodingError, ConfigError, MemoryMapError) as-is and keep
the Result<T> = std::result::Result<T, StringyError> alias.

In `@tests/classification_integration_tests.rs`:
- Around line 24-105: Update the three tests test_elf_string_classification,
test_pe_string_classification, and test_macho_string_classification to record
insta snapshots of the classifier results instead of only asserting individual
tags: after creating SemanticClassifier::new(), creating the context via
create_test_context(...), and calling classifier.classify(...), collect the
returned tag sets (for each input string like guid, email, format_string,
base64, user_agent) into a serializable structure (e.g., Vec or HashSet of Tag
names) and replace or augment the assert!(tags.contains(...)) checks with insta
snapshot assertions (e.g., insta::assert_yaml_snapshot!) to persist the expected
classification outputs for each fixture-driven test; keep at least one explicit
assert per test if desired but ensure snapshotting uses the classifier variable
and context used in each test (symbols: SemanticClassifier::new,
create_test_context, classifier.classify, and the test functions named above).

In `@tests/classification_tests.rs`:
- Around line 177-226: Extract repeated FoundString construction into a small
helper (e.g., make_found_string or build_found_string) used by
test_classify_found_string_backward_compatibility: the helper should accept
parameters used in the diffs (string content, Encoding, offset, length,
StringSource, optional section) and return a FoundString (created via
FoundString::new and .with_section when needed); replace the three inline
FoundString::new(...) constructions in the test with calls to this helper and
keep calls to classifier.classify_found_string(&found_...) unchanged to preserve
assertions for Tag::Guid, Tag::Email, and Tag::FormatString.

♻️ Duplicate comments (2)

tests/classification_integration_tests.rs (1)

148-170: Avoid hard wall-clock thresholds in unit tests.

The elapsed < Duration::from_secs(2) assertion is prone to CI variance. This was already flagged in prior review; consider moving it to benchmarks, marking as ignored, or loosening/removing the timing check.
src/classification/mod.rs (1)
75-77: Add #[non_exhaustive] to public SemanticClassifier.

Public structs should be #[non_exhaustive] for API stability. This was already flagged in prior review. As per coding guidelines.
Proposed change
-#[derive(Debug, Default)]
+#[non_exhaustive]
+#[derive(Debug, Default)]
 pub struct SemanticClassifier;

coderabbitai · 2026-01-25T20:05:56Z

src/types/error.rs

+    #[error("Binary parsing error: {0}")]
+    ParseError(String),
+
+    #[error("Invalid encoding in string at offset {offset}")]
+    EncodingError { offset: u64 },
+
+    #[error("Configuration error: {0}")]
+    ConfigError(String),
+
+    #[error("Memory mapping error: {0}")]
+    MemoryMapError(String),
+}
+
+/// Result type alias for the stringy library
+pub type Result<T> = std::result::Result<T, StringyError>;
+
+impl From<goblin::error::Error> for StringyError {
+    fn from(err: goblin::error::Error) -> Self {
+        StringyError::ParseError(err.to_string())
+    }
+}
+
+impl From<pelite::Error> for StringyError {
+    fn from(err: pelite::Error) -> Self {
+        StringyError::ParseError(err.to_string())
+    }
+}
+
+impl From<pelite::resources::FindError> for StringyError {
+    fn from(err: pelite::resources::FindError) -> Self {
+        StringyError::ParseError(format!("Resource lookup error: {}", err))
+    }
+}


🧹 Nitpick | 🔵 Trivial

Preserve parse error sources instead of flattening to String.

Converting goblin and pelite errors to String loses the error chain and makes debugging harder. Prefer dedicated variants with #[from] to preserve sources. Based on coding guidelines.

Proposed refactor to preserve sources

pub enum StringyError { @@ - #[error("Binary parsing error: {0}")] - ParseError(String), + #[error("Goblin parse error: {0}")] + GoblinError(#[from] goblin::error::Error), + + #[error("PE parse error: {0}")] + PeliteError(#[from] pelite::Error), + + #[error("PE resource lookup error: {0}")] + PeliteResourceError(#[from] pelite::resources::FindError), @@ -impl From<goblin::error::Error> for StringyError { - fn from(err: goblin::error::Error) -> Self { - StringyError::ParseError(err.to_string()) - } -} - -impl From<pelite::Error> for StringyError { - fn from(err: pelite::Error) -> Self { - StringyError::ParseError(err.to_string()) - } -} - -impl From<pelite::resources::FindError> for StringyError { - fn from(err: pelite::resources::FindError) -> Self { - StringyError::ParseError(format!("Resource lookup error: {}", err)) - } -}

🤖 Prompt for AI Agents

In `@src/types/error.rs` around lines 12 - 44, Replace the flat String-based ParseError variants and manual From impls by adding dedicated error-typed variants on the StringyError enum and deriving #[from] where appropriate so original error sources are preserved: change ParseError(String) to ParseError(#[from] goblin::error::Error) and add variants like PeliteError(#[from] pelite::Error) and ResourceFindError(#[from] pelite::resources::FindError) (or similarly-named variants) and remove the manual impl From<...> blocks for goblin and pelite; keep other variants (EncodingError, ConfigError, MemoryMapError) as-is and keep the Result<T> = std::result::Result<T, StringyError> alias.

coderabbitai · 2026-01-25T20:05:57Z

tests/classification_integration_tests.rs

+#[test]
+fn test_elf_string_classification() {
+    let fixture_path = get_fixture_path("test_binary_elf");
+    let elf_data = fs::read(&fixture_path)
+        .expect("Failed to read ELF fixture. Run the build script to generate fixtures.");
+
+    assert!(ElfParser::detect(&elf_data), "ELF detection should succeed");
+    let parser = ElfParser::new();
+    let container_info = parser.parse(&elf_data).expect("Failed to parse ELF");
+
+    assert_eq!(container_info.format, BinaryFormat::Elf);
+
+    let classifier = SemanticClassifier::new();
+    let context = create_test_context(
+        BinaryFormat::Elf,
+        SectionType::StringData,
+        StringSource::SectionData,
+    );
+
+    let guid = "{12345678-1234-1234-1234-123456789abc}";
+    let tags = classifier.classify(guid, &context);
+    assert!(tags.contains(&Tag::Guid));
+
+    let email = "admin@malware.com";
+    let tags = classifier.classify(email, &context);
+    assert!(tags.contains(&Tag::Email));
+
+    let format_string = "Error: %s at line %d";
+    let tags = classifier.classify(format_string, &context);
+    assert!(tags.contains(&Tag::FormatString));
+}
+
+#[test]
+fn test_pe_string_classification() {
+    let fixture_path = get_fixture_path("test_binary_pe.exe");
+    let pe_data = fs::read(&fixture_path)
+        .expect("Failed to read PE fixture. Run the build script to generate fixtures.");
+
+    assert!(PeParser::detect(&pe_data), "PE detection should succeed");
+    let parser = PeParser::new();
+    let container_info = parser.parse(&pe_data).expect("Failed to parse PE");
+
+    assert_eq!(container_info.format, BinaryFormat::Pe);
+
+    let classifier = SemanticClassifier::new();
+    let context = create_test_context(
+        BinaryFormat::Pe,
+        SectionType::Resources,
+        StringSource::ResourceString,
+    );
+
+    let base64 = "U29tZSBsb25nZXIgYmFzZTY0IHN0cmluZw==";
+    let tags = classifier.classify(base64, &context);
+    assert!(tags.contains(&Tag::Base64));
+
+    let user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36";
+    let tags = classifier.classify(user_agent, &context);
+    assert!(tags.contains(&Tag::UserAgent));
+}
+
+#[test]
+fn test_macho_string_classification() {
+    let fixture_path = get_fixture_path("test_binary_macho");
+    let macho_data = fs::read(&fixture_path)
+        .expect("Failed to read Mach-O fixture. Run the build script to generate fixtures.");
+
+    assert!(
+        MachoParser::detect(&macho_data),
+        "Mach-O detection should succeed"
+    );
+    let parser = MachoParser::new();
+    let container_info = parser.parse(&macho_data).expect("Failed to parse Mach-O");
+
+    assert_eq!(container_info.format, BinaryFormat::MachO);
+
+    let classifier = SemanticClassifier::new();
+    let context = create_test_context(
+        BinaryFormat::MachO,
+        SectionType::StringData,
+        StringSource::SectionData,
+    );
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if the test file exists and view its structure fd "classification_integration_tests.rs" tests/ # Count lines to understand file size wc -l tests/classification_integration_tests.rs 2>/dev/null || echo "File not found" # Check for any insta usage in the test file rg "insta" tests/classification_integration_tests.rs 2>/dev/null || echo "No insta found" # Look for existing snapshot files fd "\.snap" tests/ 2>/dev/null | head -20 || echo "No snapshots found" # Check the test structure more broadly rg "#\[test\]" tests/classification_integration_tests.rs -A 2 2>/dev/null || echo "No test macros found"

Repository: EvilBit-Labs/Stringy

Length of output: 2004

🏁 Script executed:

# Let me examine the test file to understand current structure head -50 tests/classification_integration_tests.rs # Check if there are any imports of insta in the file rg "use insta" tests/classification_integration_tests.rs # Check if the snapshot file is referenced or used rg "classification_snapshots\|assert_debug_snapshot\|assert_snapshot" tests/classification_integration_tests.rs # Look at the full test file to understand the structure wc -l tests/classification_integration_tests.rs cat tests/classification_integration_tests.rs

Repository: EvilBit-Labs/Stringy

Length of output: 7547

🏁 Script executed:

# Check the existing snapshot file to understand the pattern cat tests/snapshots/classification_integration__classification_snapshots.snap # Verify if other test files in the repository use insta snapshots rg "assert_debug_snapshot\|assert_snapshot" tests/ -l | head -5 # Check one of the output integration tests to see the insta usage pattern rg "assert_.*_snapshot" tests/output_json_integration_tests.rs -B 2 -A 2 | head -30

Repository: EvilBit-Labs/Stringy

Length of output: 733

🏁 Script executed:

# Check if there's a different classification_integration.rs file fd "classification_integration\.rs" tests/ # Look for where the snapshot is actually used rg "classification_snapshots" tests/ # Check git history or structure to understand the relationship git log --oneline tests/classification_integration* 2>/dev/null | head -10 # Check if insta is even in the dependencies rg "insta" Cargo.toml tests/Cargo.toml 2>/dev/null | head -20 # Look at the output_json test to see insta usage pattern fd "output_json_integration" tests/ -type f

Repository: EvilBit-Labs/Stringy

Length of output: 491

🏁 Script executed:

# Look at the actual classification_integration.rs file that uses snapshots cat tests/classification_integration.rs # Compare file sizes wc -l tests/classification_integration.rs tests/classification_integration_tests.rs # Check imports in classification_integration.rs head -20 tests/classification_integration.rs

Repository: EvilBit-Labs/Stringy

Length of output: 4011

Add insta snapshots for fixture-driven tests.

Tests loading binary fixtures must use insta snapshots per coding guidelines. The three ELF/PE/Mach-O format tests (lines 24-105) load fixtures from tests/fixtures/ but verify classifications only with assert!(tags.contains(...)). Snapshot the classified tag sets per format to capture expected behavior and prevent regressions.

Example refactor for ELF test

use std::fs; use std::time::{Duration, Instant}; +use insta::assert_debug_snapshot; use stringy::classification::SemanticClassifier; use stringy::container::{ContainerParser, ElfParser, MachoParser, PeParser}; use stringy::types::{BinaryFormat, Encoding, SectionType, StringContext, StringSource, Tag}; @@ let guid = "{12345678-1234-1234-1234-123456789abc}"; let tags = classifier.classify(guid, &context); assert!(tags.contains(&Tag::Guid)); + assert_debug_snapshot!("elf_guid_tags", tags); let email = "admin@malware.com"; let tags = classifier.classify(email, &context); assert!(tags.contains(&Tag::Email)); + assert_debug_snapshot!("elf_email_tags", tags);

🤖 Prompt for AI Agents

In `@tests/classification_integration_tests.rs` around lines 24 - 105, Update the three tests test_elf_string_classification, test_pe_string_classification, and test_macho_string_classification to record insta snapshots of the classifier results instead of only asserting individual tags: after creating SemanticClassifier::new(), creating the context via create_test_context(...), and calling classifier.classify(...), collect the returned tag sets (for each input string like guid, email, format_string, base64, user_agent) into a serializable structure (e.g., Vec or HashSet of Tag names) and replace or augment the assert!(tags.contains(...)) checks with insta snapshot assertions (e.g., insta::assert_yaml_snapshot!) to persist the expected classification outputs for each fixture-driven test; keep at least one explicit assert per test if desired but ensure snapshotting uses the classifier variable and context used in each test (symbols: SemanticClassifier::new, create_test_context, classifier.classify, and the test functions named above).

coderabbitai · 2026-01-25T20:05:57Z

tests/classification_tests.rs

+#[test]
+fn test_classify_found_string_backward_compatibility() {
+    let classifier = SemanticClassifier::new();
+
+    // Test GUID classification via FoundString
+    let found_guid = FoundString::new(
+        "{12345678-1234-1234-1234-123456789abc}".to_string(),
+        Encoding::Ascii,
+        0,
+        38,
+        StringSource::SectionData,
+    )
+    .with_section(".rodata".to_string());
+
+    let tags = classifier.classify_found_string(&found_guid);
+    assert!(
+        tags.contains(&Tag::Guid),
+        "GUID should be detected via classify_found_string"
+    );
+
+    // Test email classification via FoundString
+    let found_email = FoundString::new(
+        "admin@example.com".to_string(),
+        Encoding::Ascii,
+        100,
+        17,
+        StringSource::SectionData,
+    );
+
+    let tags = classifier.classify_found_string(&found_email);
+    assert!(
+        tags.contains(&Tag::Email),
+        "Email should be detected via classify_found_string"
+    );
+
+    // Test format string classification via FoundString
+    let found_format = FoundString::new(
+        "Error: %s at line %d".to_string(),
+        Encoding::Ascii,
+        200,
+        20,
+        StringSource::SectionData,
+    );
+
+    let tags = classifier.classify_found_string(&found_format);
+    assert!(
+        tags.contains(&Tag::FormatString),
+        "Format string should be detected via classify_found_string"
+    );
+}


🧹 Nitpick | 🔵 Trivial

Factor repeated FoundString construction into a helper.

Multiple tests hand-build FoundString instances. A small helper reduces duplication and keeps length/offset usage consistent.

Example helper

use stringy::types::{ BinaryFormat, Encoding, FoundString, SectionType, StringContext, StringSource, Tag, }; @@ +fn make_found_string(text: &str, offset: u64, length: u32) -> FoundString { + FoundString::new( + text.to_string(), + Encoding::Ascii, + offset, + length, + StringSource::SectionData, + ) +} @@ - let found_guid = FoundString::new( - "{12345678-1234-1234-1234-123456789abc}".to_string(), - Encoding::Ascii, - 0, - 38, - StringSource::SectionData, - ) - .with_section(".rodata".to_string()); + let found_guid = make_found_string( + "{12345678-1234-1234-1234-123456789abc}", + 0, + 38, + ) + .with_section(".rodata".to_string());

🤖 Prompt for AI Agents

In `@tests/classification_tests.rs` around lines 177 - 226, Extract repeated FoundString construction into a small helper (e.g., make_found_string or build_found_string) used by test_classify_found_string_backward_compatibility: the helper should accept parameters used in the diffs (string content, Encoding, offset, length, StringSource, optional section) and return a FoundString (created via FoundString::new and .with_section when needed); replace the three inline FoundString::new(...) constructions in the test with calls to this helper and keep calls to classifier.classify_found_string(&found_...) unchanged to preserve assertions for Tag::Guid, Tag::Email, and Tag::FormatString.

…terns, and file structure Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

- Add missing `command: zigbuild` for cross-compilation in goreleaser - Remove `rustup default stable` hook (toolchain pinned via rust-toolchain.toml) - Replace lazy_static with once_cell::sync::Lazy for consistency - Remove lazy_static dependency from Cargo.toml - Fix documentation to use StringContext::new() constructor pattern - Add missing semantic categories (URLs, domains, IPs, paths) to docs - Remove flaky wall-clock timing assertion from integration tests - Rename misleading benchmark from regex_compilation to classifier_construction - Remove duplicate mdbook definition and pin just version in mise.toml Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@AGENTS.md`:
- Around line 88-91: Replace the bold label "**New semantic tag**:" with a
proper Markdown heading (e.g., "### New semantic tag") or a list item to satisfy
MD036; keep the rest of the sentence ("Add variant to `Tag` enum in
`types/mod.rs`, implement pattern in `classification/patterns/` or
`classification/mod.rs`") unchanged and ensure the heading level aligns with
surrounding headings in AGENTS.md so references to the Tag enum (`Tag`),
types/mod.rs, and classification/patterns/ or classification/mod.rs remain
clear.

In `@benches/classification.rs`:
- Around line 84-102: In bench_batch_classification replace the mixed-format
string "Error %s at line {}" so it matches the classifier's documented patterns
(use % specifiers or numeric {digits}); update the format! call that populates
samples (inside the loop that builds samples) to use a numeric placeholder like
"Error %s at line {0}" (still passing index), so the sample vector contains the
intended %-style pattern and the classifier.classify calls
(classifier.classify(..., &context)) will see the correct mix of formats.

In `@docs/src/classification.md`:
- Around line 13-17: The inline code span for the URL regex under the "### URLs"
section contains an unescaped backtick and a stray trailing backslash which
breaks rendering; update the code span for the pattern (the string starting with
`https?://[^\s<>"{}|\\\^\[\]\`\]+\`) to use a double-backtick code span and
remove the trailing backslash so the backtick inside the character class is
preserved and the pattern renders correctly.

♻️ Duplicate comments (7)

.goreleaser.yaml (1)
11-16: Pin cargo-zigbuild in the pre-build hook.
cargo install without --version makes release builds non-deterministic and can drift from the toolchain pinned in mise.toml. Align the hook to a fixed version.
Proposed fix
-    - cargo install --locked cargo-zigbuild
+    - cargo install --locked --version 0.21.2 cargo-zigbuild
#!/bin/bash
# Confirm the pinned cargo-zigbuild version in mise.toml
rg -n '"cargo:cargo-zigbuild"' mise.toml
mise.toml (2)
9-9: Use stable cargo-nextest instead of a pre-release.
The pre-release pin increases instability and may not be published. Prefer the stable release.
Proposed fix
-"cargo:cargo-nextest"       = "0.9.123-b.4"
+"cargo:cargo-nextest"       = "0.9.123"
What is the latest stable cargo-nextest version, and does 0.9.123-b.4 exist?
18-18: Pin claude to a fixed version for reproducible toolchains.
Using "latest" makes local and CI environments drift over time.
Proposed fix
-claude                      = "latest"
+claude                      = "X.Y.Z"
What is the latest stable version string for the claude CLI (mise tool)?
src/classification/mod.rs (2)
40-76: Use lazy_static! for regex caching in classification modules.
This module uses once_cell::sync::Lazy, but the project rules require lazy_static! for regex caches. Convert the statics and ensure lazy_static is present in Cargo.toml (remove once_cell here if no longer needed). As per coding guidelines, align classification regex caching to lazy_static.
Proposed fix
-use once_cell::sync::Lazy;
+use lazy_static::lazy_static;
 use regex::Regex;
 
-static GUID_REGEX: Lazy<Regex> = Lazy::new(|| {
-    Regex::new(r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$")
-        .expect("Invalid GUID regex")
-});
-
-static EMAIL_REGEX: Lazy<Regex> = Lazy::new(|| {
-    Regex::new(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$").expect("Invalid email regex")
-});
-
-static BASE64_REGEX: Lazy<Regex> =
-    Lazy::new(|| Regex::new(r"^[A-Za-z0-9+/]{20,}={0,2}$").expect("Invalid base64 regex"));
-
-static FORMAT_REGEX: Lazy<Regex> =
-    Lazy::new(|| Regex::new(r"%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}").expect("Invalid format regex"));
-
-static USER_AGENT_REGEX: Lazy<Regex> = Lazy::new(|| {
-    Regex::new(r"(Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+)")
-        .expect("Invalid user agent regex")
-});
+lazy_static! {
+    static ref GUID_REGEX: Regex = Regex::new(
+        r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$"
+    )
+    .expect("Invalid GUID regex");
+
+    static ref EMAIL_REGEX: Regex =
+        Regex::new(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
+            .expect("Invalid email regex");
+
+    static ref BASE64_REGEX: Regex =
+        Regex::new(r"^[A-Za-z0-9+/]{20,}={0,2}$").expect("Invalid base64 regex");
+
+    static ref FORMAT_REGEX: Regex =
+        Regex::new(r"%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}").expect("Invalid format regex");
+
+    static ref USER_AGENT_REGEX: Regex = Regex::new(
+        r"(Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+)"
+    )
+    .expect("Invalid user agent regex");
+}
#!/bin/bash
# Identify lazy_static vs once_cell usage in classification modules
rg -n "lazy_static!|once_cell::sync::Lazy" src/classification
78-80: Mark SemanticClassifier as #[non_exhaustive].
This is a public struct; add #[non_exhaustive] to keep the API forward-compatible. As per coding guidelines, public structs should be non-exhaustive.
Proposed fix
+#[non_exhaustive]
 #[derive(Debug, Default)]
 pub struct SemanticClassifier;
tests/classification_integration_tests.rs (1)
23-112: Add insta snapshots for fixture-driven classification tests.
These tests load fixtures but only assert individual tags. The test guidelines require snapshot coverage for fixture-driven outputs to prevent regressions. As per coding guidelines, add insta snapshots for the tag sets.
Proposed fix (example for ELF test)
+use insta::assert_debug_snapshot;
@@
     let guid = "{12345678-1234-1234-1234-123456789abc}";
     let tags = classifier.classify(guid, &context);
     assert!(tags.contains(&Tag::Guid));
+    assert_debug_snapshot!("elf_guid_tags", tags);
docs/src/classification.md (1)
124-133: Fix integration test path reference.

The integration test file is tests/classification_integration.rs, not tests/classification_integration_tests.rs.
Proposed fix
-- Integration tests: `tests/classification_integration_tests.rs`
+- Integration tests: `tests/classification_integration.rs`

coderabbitai · 2026-01-26T04:12:01Z

AGENTS.md

 ## Adding Features

-**New semantic tag**: Add variant to `Tag` enum in `types.rs`, implement pattern in `classification/semantic.rs`
+**New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`



⚠️ Potential issue | 🟡 Minor

Fix MD036 by using a heading instead of bold text.
Markdownlint reports emphasis used as a heading. Convert the bold label to a heading or list item.

Proposed fix

-**New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs` +### New semantic tag +Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Adding Features

**New semantic tag**: Add variant to `Tag` enum in `types.rs`, implement pattern in `classification/semantic.rs`

**New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`

## Adding Features

### New semantic tag

Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`

🤖 Prompt for AI Agents

In `@AGENTS.md` around lines 88 - 91, Replace the bold label "**New semantic tag**:" with a proper Markdown heading (e.g., "### New semantic tag") or a list item to satisfy MD036; keep the rest of the sentence ("Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`") unchanged and ensure the heading level aligns with surrounding headings in AGENTS.md so references to the Tag enum (`Tag`), types/mod.rs, and classification/patterns/ or classification/mod.rs remain clear.

coderabbitai · 2026-01-26T04:12:01Z

benches/classification.rs

+fn bench_batch_classification(c: &mut Criterion) {
+    let classifier = SemanticClassifier::new();
+    let context = make_context();
+
+    let mut samples = Vec::new();
+    for index in 0..1000 {
+        samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index));
+        samples.push(format!("user{}@example.com", index));
+        samples.push(format!("Error %s at line {}", index));
+    }
+
+    c.bench_function("classification_batch", |b| {
+        b.iter(|| {
+            for sample in &samples {
+                let _ = classifier.classify(black_box(sample.as_str()), &context);
+            }
+        });
+    });
+}


⚠️ Potential issue | 🟡 Minor

Batch sample misses a format-string match.

"Error %s at line {}" uses {} which does not match the documented format-string patterns that rely on % specifiers or {digits}. This reduces the intended mix of categories in the batch set.

Proposed fix

- samples.push(format!("Error %s at line {}", index)); + samples.push(format!("Error %s at line %d {}", index));

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fn bench_batch_classification(c: &mut Criterion) {

let classifier = SemanticClassifier::new();

let context = make_context();

let mut samples = Vec::new();

for index in 0..1000 {

samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index));

samples.push(format!("user{}@example.com", index));

samples.push(format!("Error %s at line {}", index));

}

c.bench_function("classification_batch", |b| {

b.iter(|| {

for sample in &samples {

let _ = classifier.classify(black_box(sample.as_str()), &context);

}

});

});

}

fn bench_batch_classification(c: &mut Criterion) {

let classifier = SemanticClassifier::new();

let context = make_context();

let mut samples = Vec::new();

for index in 0..1000 {

samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index));

samples.push(format!("user{}@example.com", index));

samples.push(format!("Error %s at line %d {}", index));

}

c.bench_function("classification_batch", |b| {

b.iter(|| {

for sample in &samples {

let _ = classifier.classify(black_box(sample.as_str()), &context);

}

});

});

}

🤖 Prompt for AI Agents

In `@benches/classification.rs` around lines 84 - 102, In bench_batch_classification replace the mixed-format string "Error %s at line {}" so it matches the classifier's documented patterns (use % specifiers or numeric {digits}); update the format! call that populates samples (inside the loop that builds samples) to use a numeric placeholder like "Error %s at line {0}" (still passing index), so the sample vector contains the intended %-style pattern and the classifier.classify calls (classifier.classify(..., &context)) will see the correct mix of formats.

coderabbitai · 2026-01-26T04:12:01Z

docs/src/classification.md

+### URLs

-#### URLs
+- Pattern: `https?://[^\s<>"{}|\\\^\[\]\`\]+\`
+- Examples: `https://example.com/path`, `http://malware.site/payload`
+- Validation: Must start with `http://` or `https://`


⚠️ Potential issue | 🟡 Minor

URL pattern code span breaks on backtick.

The inline code span contains a backtick in the character class, which terminates the code span and renders incorrectly. Use a double-backtick code span and remove the stray trailing backslash.

Proposed fix

-- Pattern: `https?://[^\s<>"{}|\\\^\[\]\`\]+\` +- Pattern: ``https?://[^\s<>"{}|\\\^\[\]\`]+``

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### URLs

#### URLs

- Pattern: `https?://[^\s<>"{}|\\\^\[\]\`\]+\`

- Examples: `https://example.com/path`, `http://malware.site/payload`

- Validation: Must start with `http://` or `https://`

### URLs

- Pattern: ``https?://[^\s<>"{}|\\\^\[\]\`]+``

- Examples: `https://example.com/path`, `http://malware.site/payload`

- Validation: Must start with `http://` or `https://`

🤖 Prompt for AI Agents

In `@docs/src/classification.md` around lines 13 - 17, The inline code span for the URL regex under the "### URLs" section contains an unescaped backtick and a stray trailing backslash which breaks rendering; update the code span for the pattern (the string starting with `https?://[^\s<>"{}|\\\^\[\]\`\]+\`) to use a double-backtick code span and remove the trailing backslash so the backtick inside the character class is preserved and the pattern renders correctly.

unclesp1d3r added 2 commits January 24, 2026 20:19

feat(tools): update and organize tool dependencies in mise.toml

574f562

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

unclesp1d3r self-assigned this Jan 25, 2026

Copilot AI review requested due to automatic review settings January 25, 2026 06:07

unclesp1d3r linked an issue Jan 25, 2026 that may be closed by this pull request

Implement Semantic Pattern Matching: GUID, Email, Base64, Printf Format, and User Agent Detection #18

Closed

14 tasks

Copilot started reviewing on behalf of unclesp1d3r January 25, 2026 06:07 View session

coderabbitai bot added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 25, 2026

Copilot AI reviewed Jan 25, 2026

View reviewed changes

coderabbitai bot reviewed Jan 25, 2026

View reviewed changes

refactor(classification): streamline StringContext creation and remov…

771852a

…e deprecated semantic module Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

unclesp1d3r requested a review from Copilot January 25, 2026 18:14

unclesp1d3r changed the title ~~fix: implement semantic pattern matching guid email base64 printf format and user agent detection~~ feat(classification): add semantic pattern matching for GUID, email, base64, format strings, and user agents Jan 25, 2026

Copilot started reviewing on behalf of unclesp1d3r January 25, 2026 18:15 View session

coderabbitai bot reviewed Jan 25, 2026

View reviewed changes

Copilot AI reviewed Jan 25, 2026

View reviewed changes

feat: Implement semantic classification enhancements with error handl…

9b7789d

…ing and regex caching Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai bot reviewed Jan 25, 2026

View reviewed changes

unclesp1d3r and others added 2 commits January 25, 2026 15:16

docs: Update guidelines for error handling, test-only code, regex pat…

8cb6ee8

…terns, and file structure Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

unclesp1d3r merged commit e118ffa into main Jan 26, 2026
22 checks passed

unclesp1d3r deleted the 18-implement-semantic-pattern-matching-guid-email-base64-printf-format-and-user-agent-detection branch January 26, 2026 04:22

		- Unit tests: `tests/classification_tests.rs`
		- Integration tests: `tests/classification_integration_tests.rs`

Uh oh!

feat(classification): add semantic pattern matching for GUID, email, base64, format strings, and user agents #131

feat(classification): add semantic pattern matching for GUID, email, base64, format strings, and user agents #131

Uh oh!

Conversation

unclesp1d3r commented Jan 25, 2026

Uh oh!

coderabbitai bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Architectural & API Refactoring

Key Public API Changes

Binary Format & Section Handling

Semantic Classification Accuracy & Validation

Performance & Reliability

Tooling, Docs & Build

Removals & Breaking Changes

Notes on Review Focus

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Jan 25, 2026 •

edited

Loading

codecov bot commented Jan 25, 2026 •

edited

Loading