Skip to content

Conversation

@unclesp1d3r
Copy link
Member

This pull request introduces a new benchmarking suite for the string classification system, updates documentation to reflect recent classifier changes, and improves build, dependency, and release tooling. The most notable changes are the addition of classification benchmarks, a major rewrite of the classification documentation, and enhanced GoReleaser and toolchain configuration.

Classification System Improvements:

  • Added a new benchmark suite benches/classification.rs to measure the performance of various string classification tasks, including GUIDs, emails, base64, format strings, user agents, batch classification, and worst-case scenarios. [1] [2]
  • Updated the domain classifier to avoid double-tagging domains that are actually email addresses.

Documentation Updates:

  • Completely rewrote docs/src/classification.md to document the new classification pipeline, validation rules, semantic categories, performance notes, and testing instructions. The documentation now matches the current implementation and usage patterns.

Build & Toolchain Enhancements:

  • Added a .goreleaser.yaml configuration for cross-platform Rust builds, including pre-build hooks for Rust toolchain setup and cross-compilation with cargo-zigbuild.
  • Modernized the justfile GoReleaser and lint tasks, removing platform-specific duplication and switching to markdownlint-cli2. [1] [2]
  • Expanded the mise.toml toolchain with additional Rust and documentation tools, and updated several tool versions for improved reproducibility. [1] [2]

Internal Code Cleanup:

  • Removed unused regex re-exports from src/classification/patterns/mod.rs that are no longer needed for classifier cache testing.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Updated the `SemanticClassifier` to utilize a new `StringContext` struct for context-aware classification.
- Enhanced classification capabilities for GUIDs, emails, Base64 strings, format strings, and user agents.
- Improved regex patterns and validation logic to reduce false positives.
- Consolidated path detection logic and removed redundant regex patterns.
- Updated tests to cover new classification scenarios and ensure performance benchmarks.
- Refactored integration tests to validate classification against various binary formats (ELF, PE, Mach-O).
- Removed deprecated methods and cleaned up code for better readability and maintainability.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@unclesp1d3r unclesp1d3r self-assigned this Jan 25, 2026
Copilot AI review requested due to automatic review settings January 25, 2026 06:07
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 25, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Architectural & API Refactoring

This PR restructures the semantic classification system into a context-aware model and consolidates classification logic into src/classification/mod.rs (deleting the old semantic.rs). The classifier API moves away from a FoundString wrapper and toward a lightweight context-driven signature: SemanticClassifier::classify(&self, text: &str, context: &StringContext) plus a backwards-compatible classify_found_string(&self, &FoundString). apply_semantic_enrichment now accepts &ContainerInfo to construct and pass StringContext to the classifier.

Key Public API Changes

  • Added StringContext (non_exhaustive) with: section_type, optional section_name, binary_format, encoding, source; constructor StringContext::new(...) and with_section_name(...)
  • SemanticClassifier is now the primary entry in src/classification/mod.rs with:
    • pub fn classify(&self, text: &str, context: &StringContext) -> Vec
    • pub fn classify_found_string(&self, found: &crate::types::FoundString) -> Vec
    • pub fn regex_cache_addresses(&self) -> RegexCacheAddresses (for testing)
  • New (doc(hidden) pub) RegexCacheAddresses to expose internal regex cache addresses for tests
  • src/types/mod.rs exposes StringContext and re-exports it from lib.rs
  • Removed semantic.rs (old classifier implementation) and many of its public re-exports

Binary Format & Section Handling

  • Classification decisions are now context-aware: section_type and binary_format (ELF/PE/Mach-O) are used to adjust validation/boosting rules (e.g., .rodata vs other sections).
  • apply_semantic_enrichment constructs a section_map from ContainerInfo, determines SectionType for each FoundString, builds a StringContext (attaching section_name when available), and passes the text + context to the classifier.
  • This enables more precise detection (e.g., GUIDs, format strings) when the binary section indicates string literal regions.

Semantic Classification Accuracy & Validation

  • New semantic pattern matching for GUID, Email, Base64, Format String, and User-Agent with explicit validation helpers (is_valid_guid, is_valid_email, is_valid_base64, is_valid_format_string, is_valid_user_agent).
  • Classification pipeline now separates pattern matching from validation, includes context-aware confidence boosting, length checks, entropy heuristics, and TLD validation for domains.
  • classify_domain now avoids double-tagging email-like inputs by early-exiting if input contains '@'.
  • Added Tag::Guid and explicitly documents/uses semantic categories: Guid, Email, Base64, FormatString, UserAgent, while retaining URL/Domain/IP/FilePath/Registry classification (with enhanced validation).
  • Backwards-compatible classify_found_string helper preserves older callsites where needed.

Performance & Reliability

  • Introduces lazy, once_cell-based regex caching for GUID, Email, Base64, FormatString, and UserAgent, exposed via regex_cache_addresses for testing.
  • Replaced unwrap() with explicit expect(...) messages across regex initializers (data.rs, ip.rs, network.rs, paths.rs) for clearer panic messages on invalid regexes.
  • New Criterion benchmarking suite benches/classification.rs covering classifier construction, per-pattern workloads (GUID, email, base64, format strings, user agents), batch classification (1000 samples), worst-case inputs, and context creation overhead.

Tooling, Docs & Build

  • Docs: Full rewrite of docs/src/classification.md to describe the new pipeline, validation rules, semantic categories, performance notes, and testing instructions.
  • Release/build: Added .goreleaser.yaml with pre-build hooks and cargo-zigbuild cross-compilation; updated justfile to reduce duplication and use markdownlint-cli2.
  • Development tooling: mise.toml expanded with numerous cargo:* and other tool pins (cargo-audit, cargo-deny, cargo-zigbuild, mdbook plugins, etc.).
  • Tests: New and updated tests and integration suites (tests/classification_tests.rs and tests/classification_integration_tests.rs) exercising GUID/Email/Base64/FormatString/UserAgent detection and context-sensitive behavior; existing tests updated to use StringContext.

Removals & Breaking Changes

  • Deleted src/classification/semantic.rs and removed older FoundString-first classifier implementation.
  • Classifier signature changed (breaking for callers): classify(&FoundString) → classify(&str, &StringContext). apply_semantic_enrichment signature updated to accept &ContainerInfo.
  • Patterns module: removed certain regex re-exports from src/classification/patterns/mod.rs (no longer publicly re-exported).
  • StringyError and Result moved into a dedicated src/types/error.rs (new error enum and From impls); types/mod.rs removed previous error definitions.
  • Consumers must construct and pass StringContext when calling classify; tests and extraction call sites updated accordingly.

Notes on Review Focus

  • Architectural impact: consolidation of classifier surface and movement to context-aware API; ensure no regressions from deleting semantic.rs and that classify_found_string remains compatible where needed.
  • API compatibility: callers must be updated to construct StringContext; confirm public re-exports in src/lib.rs expose the new types as intended.
  • Binary/section semantics: verify section_type mapping from ContainerInfo is correct for all extractor code paths and that boosting/validation heuristics are guarded to avoid false positives.
  • Classification correctness: review new validation logic (length, entropy, TLD checks) and regex patterns for GUID/email/base64/format/user-agent to ensure sensitivity/specificity trade-offs are acceptable.

Walkthrough

This PR rewrites and consolidates the string classification subsystem into a context-aware API. The classifier now accepts (text: &str, context: &StringContext), regex initializers use explicit expect messages, many pattern modules and tests are reorganized, and build/tooling/benchmarking infra (goreleaser, benches, mise, justfile) is added or updated.

Changes

Cohort / File(s) Summary
Core classification
src/classification/mod.rs, src/classification/semantic.rs
semantic.rs removed; classification logic consolidated into mod.rs. New public SemanticClassifier with fn classify(&self, text: &str, context: &StringContext) -> Vec<Tag> plus backward-compatible classify_found_string. Adds regex_cache_addresses() accessor and RegexCacheAddresses (doc(hidden)).
Patterns & regexes
src/classification/patterns/* (e.g. mod.rs, network.rs, paths.rs, data.rs, ip.rs)
Removed some re-exports of regex constants; replaced .unwrap() with .expect("Invalid ... regex") on Lazy regex inits; network.rs domain classifier early-exits on email-like input and uses expect; paths.rs drops REGISTRY_PATH_REGEX re-export. New/updated GUID/EMAIL/BASE64/FORMAT/USER_AGENT regexes live under patterns.
Types / context / errors
src/types/mod.rs, src/types/error.rs, src/types/tests.rs
New public StringContext (section_type, section_name, binary_format, encoding, source) with new() and with_section_name(). New error module StringyError and Result<T> alias moved to src/types/error.rs. Added FoundString serde unit tests.
Extraction integration
src/extraction/mod.rs, src/lib.rs
apply_semantic_enrichment signature changed to accept &ContainerInfo; builds a section_map from container segments and constructs a StringContext per FoundString; StringContext re-exported from crate root.
Build & benchmark infra
.goreleaser.yaml, Cargo.toml, benches/classification.rs
Adds .goreleaser.yaml (v2) with zig-based cross-builds and packaging rules; updates dependencies (entropy, thiserror), adds a classification bench target, and introduces benches/classification.rs with multiple Criterion benchmarks.
Tooling / task runner
justfile, mise.toml
justfile swaps markdownlintmarkdownlint-cli2, removes several goreleaser recipes and adds an annotated goreleaser-build-target arg. mise.toml reorganizes tools under cargo:* and adds many dev tools (cargo-audit, cargo-deny, cargo-zigbuild, mdbook plugins, zig, lychee, pipx:mdformat).
Tests & docs
docs/src/classification.md, tests/* (classification*.rs, classification_integration*.rs)
Docs refocused toward GUID-centric examples and a validation step; tests updated/added to use StringContext and new classification API. New unit and integration test suites covering GUID, Email, Base64, FormatString, UserAgent, path/URL cases and batch/perf benches.

Sequence Diagram(s)

(omitted — changes are internal library refactors and tests; no multi-actor runtime sequence diagram applicable)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🧩 Context arrives, classifiers awake,
Regexes hardened, expectations take stake,
GUIDs and emails now seen with a view,
Benchmarks hum, toolchains refreshed too —
Small refactors, big clarity; ship it true.

🚥 Pre-merge checks | ✅ 6 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Error Handling ⚠️ Warning StringyError is defined but never used; functions return panics via .expect() instead of Result<T, StringyError>, violating stated error handling conventions. Update all fallible operations to return Result<T, StringyError> and replace .expect() calls with proper error propagation using the ? operator throughout classification and extraction modules.
Ascii Only ⚠️ Warning PR introduces non-ASCII emojis (✅, 🚧) in src/lib.rs documentation comments, violating the ASCII-only requirement stated in AGENTS.md. Remove emoji status indicators from src/lib.rs lines 57-67 and replace with ASCII equivalents such as [OK] or [WIP].
✅ Passed checks (6 passed)
Check name Status Explanation
Title check ✅ Passed The PR title follows Conventional Commits format with feat type and classification scope, clearly describing the semantic pattern matching features being added.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering classification improvements, documentation updates, build enhancements, and internal cleanup.
Docstring Coverage ✅ Passed Docstring coverage is 92.86% which is sufficient. The required threshold is 70.00%.
No Unsafe Code ✅ Passed No unsafe code found. The crate maintains #![forbid(unsafe_code)] at lib.rs top level with zero unsafe blocks or #[allow(unsafe_code)] attributes introduced.
File Size Limit ✅ Passed All 8 new files comply with 500-line limit (max 247 lines), no blanket #[allow] attributes found, and pre-existing oversized file was minimally modified (+33/-5).
Section Weight Consistency ✅ Passed Custom check for section weight consistency does not apply; no container parser files modified.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 18-implement-semantic-pattern-matching-guid-email-base64-printf-format-and-user-agent-detection

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai bot added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 25, 2026
@codecov
Copy link

codecov bot commented Jan 25, 2026

Codecov Report

❌ Patch coverage is 92.33227% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/classification/mod.rs 93.67% 15 Missing ⚠️
src/types/error.rs 0.00% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands Stringy’s semantic classification capabilities and tooling by introducing context-aware classification (via StringContext), adding new tag detectors (GUID/email/base64/format strings/user agents), and improving benchmarking/testing plus release/toolchain configs.

Changes:

  • Added StringContext and rewired semantic enrichment/classification to be context-aware across extraction and tests.
  • Implemented additional semantic detectors (GUID/email/base64/format strings/user agents) and adjusted domain tagging to avoid email double-tagging.
  • Added Criterion benchmarks and updated build/release/toolchain/docs to reflect the new classification pipeline.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/types.rs Adds StringContext to carry classification context metadata.
src/lib.rs Re-exports StringContext as part of the public API.
src/extraction/mod.rs Passes container-derived context into semantic enrichment/classification.
src/classification/mod.rs Reworks SemanticClassifier into a context-aware classifier and adds new detectors.
src/classification/patterns/network.rs Prevents domain tagging when input looks like an email address.
src/classification/patterns/mod.rs Removes regex re-exports no longer needed for cache testing.
src/classification/patterns/paths.rs Removes REGISTRY_PATH_REGEX and relies on validation logic for registry detection.
src/classification/semantic.rs Updates imports, but now appears out-of-sync with the active classifier/module tree.
tests/classification_tests.rs Adds unit tests for new detectors and context-aware behavior.
tests/classification_integration_tests.rs Adds integration coverage across ELF/PE/Mach-O fixtures plus perf/cache checks.
tests/classification_integration.rs Updates snapshot/perf tests to match the new classifier API and tags.
tests/snapshots/classification_integration__classification_snapshots.snap Updates expected snapshot outputs for new tags.
benches/classification.rs Adds Criterion benchmark suite for classification workloads.
Cargo.toml Adds benchmark target and new dependency (lazy_static); bumps some versions.
docs/src/classification.md Rewrites classification documentation for the new pipeline and tags.
justfile Updates lint/docs tooling invocation and simplifies GoReleaser tasks.
mise.toml Expands toolchain tools/versions and adds more build/documentation tooling.
.goreleaser.yaml Adds GoReleaser config for cross-platform Rust builds and hooks for tool setup.

Comment on lines +103 to +106
#[must_use]
pub fn classify(&self, text: &str, context: &StringContext) -> Vec<Tag> {
let mut tags = Vec::new();

Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SemanticClassifier::classify has been changed to require (&str, &StringContext), which is a breaking change for any downstream code that previously called classify(&FoundString). If this is intended to be backward-compatible, consider restoring the old classify(&FoundString) signature as a thin wrapper (and optionally deprecate it) while introducing a new name for the context-aware API (e.g., classify_with_context).

Copilot uses AI. Check for mistakes.
Comment on lines 40 to 62
use lazy_static::lazy_static;
use regex::Regex;

use crate::types::{BinaryFormat, SectionType, StringContext, StringSource, Tag};

pub mod patterns;
pub mod ranking;
pub mod semantic;
pub mod symbols;

pub use ranking::{RankingConfig, RankingEngine};
pub use semantic::SemanticClassifier;
pub use symbols::SymbolDemangler;

// Import pattern classification functions
use patterns::{
classify_domain, classify_ip_addresses, classify_posix_path, classify_registry_path,
classify_unc_path, classify_url, classify_windows_path,
};

lazy_static! {
static ref GUID_REGEX: Regex = Regex::new(
r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$",
)
.expect("Invalid GUID regex");
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module introduces lazy_static for regex caching, but the rest of the codebase uses once_cell::sync::Lazy for the same purpose (e.g., src/classification/patterns/network.rs:6-22, src/classification/patterns/paths.rs:6-24). Consider switching these regex statics to once_cell::sync::Lazy and dropping the extra lazy_static dependency to keep conventions and dependencies consistent.

Copilot uses AI. Check for mistakes.
Comment on lines 11 to 41
## Semantic Categories

### Network Indicators
### GUIDs

#### URLs
- Pattern: `\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}`
- Examples: `{12345678-1234-1234-1234-123456789abc}`
- Validation: Strict format compliance with braces required

- **Pattern**: `` https?://[^\s<>"{}|\\^\[\]\`]+ ``
- **Examples**: `https://api.example.com/v1/users`, `http://malware.com/payload`
- **Validation**: URL format check with safe character filtering
- **Security relevance**: High - indicates network communication
### Email Addresses

#### Domain Names
- Pattern: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
- Examples: `admin@malware.com`, `user.name+tag@example.co.uk`
- Validation: Single `@`, valid TLD length and characters, no empty parts

- **Pattern**: `\b(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}\b`
- **Examples**: `api.example.com`, `malware-c2.net`
- **Validation**: TLD checking, DNS format compliance
- **Security relevance**: High - C2 domains, legitimate services
### Base64 Data

#### IP Addresses
- Pattern: `[A-Za-z0-9+/]{20,}={0,2}`
- Examples: `U29tZSBsb25nZXIgYmFzZTY0IHN0cmluZw==`
- Validation: Length >= 20, length divisible by 4, padding rules, entropy threshold

- **IPv4 Pattern**: `\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b`
- **IPv6 Pattern**: Comprehensive pattern supporting full notation, compressed notation (`::1`), and mixed notation (`::ffff:192.0.2.1`)
- **Examples**: `192.168.1.1`, `2001:db8::1`, `[::1]:8080`
- **Validation**: Two-stage validation using regex pre-filter followed by `std::net::IpAddr` parsing for correctness
- **Port Handling**: IP addresses with ports (e.g., `192.168.1.1:8080`) are supported by automatically stripping the port suffix before validation
- **IPv6 Bracket Handling**: Bracketed IPv6 addresses (e.g., `[::1]` and `[::1]:8080`) are supported
- **False Positive Mitigation**: Version numbers like `1.2.3.4` are accepted as IPv4 addresses by design
- **Implementation**: See `src/classification/semantic.rs` for the complete implementation
- **Security relevance**: High - infrastructure indicators
### Format Strings

### File System Indicators
- Pattern: `%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}`
- Examples: `Error: %s at line %d`, `User {0} logged in`
- Validation: Reasonable specifier count, context-aware thresholds

#### File Paths
### User Agents

- **POSIX Pattern**: `^/[^\0\n\r]*`
- **Windows Pattern**: `^[A-Za-z]:\\[^\0\n\r]*`
- **UNC Pattern**: `^\\\\[a-zA-Z0-9.-]+\\[^\0\n\r]*`
- **Examples**: `/usr/bin/malware`, `C:\\Windows\\System32\\evil.dll`, `\\\\server\\share\\file.txt`
- **Validation rules**: Rejects null bytes, newlines, carriage returns; rejects consecutive path separators in POSIX paths (`//`) and consecutive backslashes in Windows paths (for example, `folder\\\\file.txt`), while allowing UNC paths that start with `\\\\`; applies a reasonable length limit (4096 max, stricter for unknown prefixes); POSIX paths must be absolute (start with `/`); Windows paths must use backslashes and a valid drive letter
- **Suspicious path examples**: `/etc/cron.d/`, `/etc/init.d/`, `/usr/local/bin/`, `/tmp/`, `/var/tmp/`; `C:\\Windows\\System32\\`, `C:\\Windows\\Temp\\`, `...\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Startup\\`
- **Security relevance**: Medium-High - persistence and execution locations

#### Registry Paths

- **Full root pattern**: `^HKEY_[A-Z_]+\\[^\0\n\r]*`
- **Abbreviated root pattern**: `^HK(LM|CU|CR|U|CC)\\[^\0\n\r]*`
- **Supported root keys**:
- `HKEY_LOCAL_MACHINE`
- `HKEY_CURRENT_USER`
- `HKEY_CLASSES_ROOT`
- `HKEY_USERS`
- `HKEY_CURRENT_CONFIG`
- **Supported abbreviations**:
- `HKLM`, `HKCU`, `HKCR`, `HKU`, `HKCC`
- **Suspicious registry paths**:
- `\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Run`
- `\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\RunOnce`
- `\\System\\CurrentControlSet\\Services`
- `\\SOFTWARE\\Microsoft\\Windows NT\\CurrentVersion\\Winlogon`
- **Examples**:
- `HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Run`
- `HKCU\\Software\\Microsoft`
- **Security relevance**: High - persistence mechanisms

### Identifiers

#### GUIDs/UUIDs

- **Pattern**: `\{?[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}?`
- **Examples**: `{12345678-1234-1234-1234-123456789abc}`, `12345678-1234-1234-1234-123456789abc`
- **Validation**: Format compliance
- **Security relevance**: Medium - component identification

#### Email Addresses

- **Pattern**: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
- **Examples**: `admin@malware.com`, `support@legitimate.org`
- **Validation**: Basic format validation
- **Security relevance**: Medium - contact information

### Code Artifacts

#### Format Strings

- **Pattern**: `%[-+0 #]*(\d+|\*)?(\.(\d+|\*))?(hh?|ll?|[Lzjt])?[diouxXeEfFgGaAcspn%]`
- **Examples**: `Error: %s at line %d`, `Name: %s, Age: %d, Score: %.2f`
- **Context**: Presence of real format specifiers (%% alone is ignored)
- **Security relevance**: Low-Medium - debugging information

#### Base64 Data

- **Pattern**: Character set validation with padding rules
- **Examples**: `SGVsbG8gV29ybGQ=`
- **Validation**: Length >= 16, Base64 character set, valid padding, reject length mod 4 of 1
- **Security relevance**: Variable - encoded payloads

#### User Agents

- **Pattern**: Prefix match for common agents (Mozilla, curl, Wget, python-requests, libwww-perl, Java, Apache-HttpClient, okhttp, PostmanRuntime)
- **Examples**: `Mozilla/5.0 (Windows NT 10.0; Win64; x64)`, `curl/7.68.0`
- **Security relevance**: Medium - network fingerprinting

## Tag Specificity

Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk.
- Pattern: `Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+`
- Examples: `Mozilla/5.0 (Windows NT 10.0; Win64; x64)`, `Chrome/117.0.5938.92`
- Validation: Known browser identifiers and minimum length
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rewritten documentation lists only GUID/email/base64/format-string/user-agent categories, but the classifier still supports URL/domain/IP/file-path/registry-path tagging (see src/classification/mod.rs capability list and implementation). Please update this doc to include the still-supported tag categories so it matches the current behavior.

Copilot uses AI. Check for mistakes.
Comment on lines 11 to 17
before:
hooks:
# if you don't do these things before calling goreleaser, it might be a
# good idea to do them here:
- rustup default stable
- cargo install --locked cargo-zigbuild
- cargo fetch --locked
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The before.hooks steps have reproducibility/side-effect concerns: rustup default stable mutates the global toolchain (and the repo already pins a toolchain via rust-toolchain.toml), and cargo install --locked cargo-zigbuild does not pin a cargo-zigbuild version. Consider removing the rustup default step and pinning cargo-zigbuild with --version (or relying on a preinstalled/pinned tool) to make release builds deterministic.

Copilot uses AI. Check for mistakes.
Comment on lines 20 to 23
git-cliff = "2.11.0"
goreleaser = "2.13.3"
just = "latest"
markdownlint-cli2 = "0.20.0"
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just = "latest" reduces toolchain reproducibility compared to the previous pinned version, which can make CI/local behavior drift over time. If reproducible builds are a goal, consider pinning just to a specific version here (similar to other tools).

Copilot uses AI. Check for mistakes.
Comment on lines 22 to 30
[dependencies]
clap = { version = "4.5.54", features = [ "derive" ] }
cpp_demangle = "0.5.1"
entropy = "0.4.2"
entropy = "0.4.3"
goblin = "0.10.4"
lazy_static = "1.5.0"
once_cell = "1.21.3"
pelite = "0.10.0"
regex = "1.12.2"
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy_static was added as a new dependency, but existing regex statics in this repo use once_cell::sync::Lazy. If you switch the new classifier regex statics to once_cell, you can likely remove lazy_static here to avoid an extra dependency and keep the project consistent.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/classification/patterns/network.rs (1)

114-141: Add test coverage for email exclusion from domain classification.

The existing tests cover URL exclusion but not the new email exclusion behavior.

🧪 Suggested test addition
 #[test]
 fn test_domain_valid_and_invalid() {
     assert!(classify_domain("example.com").is_some());
     assert!(classify_domain("https://example.com").is_none());
     assert!(classify_domain("invalid.xyz123").is_none());
 }

+#[test]
+fn test_email_not_tagged_as_domain() {
+    // Email addresses should not be tagged as domains
+    assert!(classify_domain("user@example.com").is_none());
+    assert!(classify_domain("admin@subdomain.example.org").is_none());
+}
src/lib.rs (1)

69-73: Add crate-level safety attributes to the library root.

The #![forbid(unsafe_code)] and #![deny(warnings)] attributes are missing from src/lib.rs. These must be present at the crate root before any module declarations. Add them immediately after the crate documentation block and before the pub mod statements:

#![forbid(unsafe_code)]
#![deny(warnings)]

pub mod classification;
tests/classification_integration.rs (1)

50-72: Avoid timing-based assertion in unit tests.
Line 72 enforces a 500ms budget, which is CI and load sensitive and can cause flaky failures. Gate this behind an env flag or move it to benchmarks.

Proposed gating to avoid flaky CI runs
 #[test]
 fn test_classification_performance() {
+    if std::env::var("STRINGY_PERF_TESTS").is_err() {
+        return;
+    }
     let classifier = SemanticClassifier::new();
     ...
     assert!(elapsed < Duration::from_millis(500));
 }
As per coding guidelines, avoid timing-dependent assertions in tests.
🤖 Fix all issues with AI agents
In @.goreleaser.yaml:
- Around line 30-38: The rust builder block (builder: rust with id: cross) is
missing the required command for cargo-zigbuild; update that block to add
"command: zigbuild" so GoReleaser uses cargo-zigbuild instead of the default
cargo build when cross-compiling (ensuring targets like x86_64-unknown-linux-gnu
and aarch64-unknown-linux-gnu are built correctly).

In `@Cargo.toml`:
- Around line 27-28: Inconsistency: some pattern modules use
once_cell::sync::Lazy while src/classification/mod.rs uses lazy_static!; convert
those pattern modules to use lazy_static! for uniform lazy initialization. For
each pattern module that currently declares statics like "static NAME:
Lazy<Type> = Lazy::new(|| ...)", replace the import once_cell::sync::Lazy with
"use lazy_static::lazy_static;" and change the declaration to "lazy_static! {
static ref NAME: Type = ...; }", preserving the initializer body and ensuring
types match (e.g., HashMap, Regex); remove once_cell imports and usages, keep
the existing lazy_static dependency in Cargo.toml, and mirror the same pattern
used in src/classification/mod.rs so all modules use lazy_static! consistently.

In `@docs/src/classification.md`:
- Around line 96-97: Update the documentation reference in
docs/src/classification.md: replace the incorrect integration test path
`tests/classification_integration_tests.rs` with the actual test file name
`tests/classification_integration.rs` so the docs point to the correct
integration test.

In `@mise.toml`:
- Line 10: The file defines mdbook twice; remove the duplicate backend entry so
only one key remains (either "cargo:mdbook" or "mdbook"). Locate the two keys
"cargo:mdbook" and "mdbook" in the mise.toml and delete the one you do not want
to use (prefer keeping "cargo:mdbook" if you need the cargo backend), ensuring
the remaining key retains its version string.
- Around line 18-22: The manifest uses "latest" for the claude and just entries
which breaks reproducible builds; update the claude and just keys to fixed
semantic versions (e.g., replace claude = "latest" and just = "latest" with
explicit version strings) so builds are deterministic, ensure the chosen
versions are compatible with the other listed tools (cyclonedx, git-cliff,
goreleaser), and run any dependency lock/update step your repo uses after
changing the versions to commit the resolved state.
- Line 9: The manifest pins a pre-release "cargo:cargo-nextest" = "0.9.123-b.4";
update it to the stable release by replacing the value with "0.9.123" (i.e.,
change the "cargo:cargo-nextest" entry to use the stable 0.9.123 version) to
avoid using an undocumented pre-release.

In `@src/classification/mod.rs`:
- Around line 75-86: Mark both public structs with #[non_exhaustive] to follow
the guideline: add #[non_exhaustive] above SemanticClassifier and
RegexCacheAddresses. For RegexCacheAddresses, make the fields private (remove
pub on guid, email, base64, format, user_agent) and add a public constructor pub
fn new(guid: usize, email: usize, base64: usize, format: usize, user_agent:
usize) -> Self that returns Self { guid, email, base64, format, user_agent },
keeping the existing derives; update any call sites to use
RegexCacheAddresses::new instead of struct literal field initialization.

In `@src/types.rs`:
- Around line 295-308: Add stability guarantees to the public API type
StringContext by marking the struct with #[non_exhaustive] and adding an
explicit constructor similar to ContainerInfo; specifically, annotate the
StringContext definition with #[non_exhaustive] and implement impl StringContext
{ pub fn new(section_type: SectionType, section_name: Option<String>,
binary_format: BinaryFormat, encoding: Encoding, source: StringSource) -> Self {
... } } so callers must use the constructor and future fields can be added
without breaking changes.

In `@tests/classification_integration_tests.rs`:
- Around line 15-27: The test helper create_test_context constructs
StringContext via a struct literal which will break if StringContext becomes
#[non_exhaustive]; change create_test_context to call the provided constructor
(e.g. StringContext::new) instead of using the literal, passing section_type,
Some(".rodata".to_string()) (or the constructor's section_name param),
binary_format, Encoding::Ascii, and source so the helper remains compatible with
future additions to StringContext.
- Around line 153-179: The performance test test_classification_performance
currently asserts elapsed < Duration::from_secs(2) for ~3600 samples which is
inconsistent with the other integration test's 500ms budget; update this test to
use the same timeout strategy (e.g., change the assertion to use
Duration::from_millis(500) or adjust the samples count to match that budget) or
add a brief inline comment near test_classification_performance explaining why
this test uses a longer 2s timeout for CI stability; ensure you touch the
Duration::from_secs(2) assertion and/or the samples construction and keep
references to SemanticClassifier::new(), classifier.classify(...), and
regex_cache_addresses() intact.

Cargo.toml Outdated
Comment on lines 27 to 28
lazy_static = "1.5.0"
once_cell = "1.21.3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check which lazy initialization pattern is used where
echo "=== lazy_static! usage ==="
rg -n 'lazy_static!' --type rust

echo ""
echo "=== once_cell::sync::Lazy usage ==="
rg -n 'once_cell::sync::Lazy' --type rust

echo ""
echo "=== once_cell re-exports (lazy_static or Lazy) in module level ==="
rg -n 'once_cell' --type rust | head -20

echo ""
echo "=== Cargo.toml dependencies around lines 25-35 ==="
sed -n '25,35p' Cargo.toml

Repository: EvilBit-Labs/Stringy

Length of output: 1115


Inconsistent lazy initialization patterns.

The codebase mixes two lazy initialization approaches: lazy_static! is used in src/classification/mod.rs, while once_cell::sync::Lazy is used across four pattern modules. Standardize on one approach for consistency.

Given that lazy_static is listed as a core dependency in the coding guidelines and is actively used, keep it. Migrate pattern modules from once_cell::sync::Lazy to lazy_static! to maintain uniform lazy initialization patterns.

🤖 Prompt for AI Agents
In `@Cargo.toml` around lines 27 - 28, Inconsistency: some pattern modules use
once_cell::sync::Lazy while src/classification/mod.rs uses lazy_static!; convert
those pattern modules to use lazy_static! for uniform lazy initialization. For
each pattern module that currently declares statics like "static NAME:
Lazy<Type> = Lazy::new(|| ...)", replace the import once_cell::sync::Lazy with
"use lazy_static::lazy_static;" and change the declaration to "lazy_static! {
static ref NAME: Type = ...; }", preserving the initializer body and ensuring
types match (e.g., HashMap, Regex); remove once_cell imports and usages, keep
the existing lazy_static dependency in Cargo.toml, and mirror the same pattern
used in src/classification/mod.rs so all modules use lazy_static! consistently.

Comment on lines +96 to +97
- Unit tests: `tests/classification_tests.rs`
- Integration tests: `tests/classification_integration_tests.rs`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix integration test path in docs.
Line 97 references tests/classification_integration_tests.rs, but the test file in this PR is tests/classification_integration.rs.

Proposed doc fix
-- Integration tests: `tests/classification_integration_tests.rs`
+- Integration tests: `tests/classification_integration.rs`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Unit tests: `tests/classification_tests.rs`
- Integration tests: `tests/classification_integration_tests.rs`
- Unit tests: `tests/classification_tests.rs`
- Integration tests: `tests/classification_integration.rs`
🤖 Prompt for AI Agents
In `@docs/src/classification.md` around lines 96 - 97, Update the documentation
reference in docs/src/classification.md: replace the incorrect integration test
path `tests/classification_integration_tests.rs` with the actual test file name
`tests/classification_integration.rs` so the docs point to the correct
integration test.

"cargo:cargo-deny" = "0.19.0"
"cargo:cargo-dist" = "0.30.3"
"cargo:cargo-llvm-cov" = "0.6.24"
"cargo:cargo-nextest" = "0.9.123-b.4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the mise.toml file
cat -n mise.toml | head -20

Repository: EvilBit-Labs/Stringy

Length of output: 952


🏁 Script executed:

# Search for cargo-nextest version information in the codebase
rg "cargo-nextest" -A 2 -B 2

Repository: EvilBit-Labs/Stringy

Length of output: 1101


🌐 Web query:

cargo-nextest 0.9.123 pre-release version

💡 Result:

cargo-nextest 0.9.123 had beta (pre-release) builds:

  • 0.9.123-b.1 (2026-01-18): introduces experimental record & replay test runs. [1]
  • 0.9.123-b.2 (2026-01-18): improvements to record-replay (storage + display). [1]
  • 0.9.123-b.3 (2026-01-21): adds rerun functionality on top of record-replay. [1]

If you want the pre-release binaries, use the “release URL” scheme and put the exact pre-release in {version} (for example 0.9.123-b.3):
https://get.nexte.st/{version}/{platform} [2]

Example (Linux x86_64):

curl -LsSf https://get.nexte.st/0.9.123-b.3/linux | tar zxf - -C ${CARGO_HOME:-~/.cargo}/bin

Note: the stable cargo-nextest 0.9.123 release was published on 2026-01-23. [3]

Sources: [1] [2] [3]


Use stable cargo-nextest 0.9.123 instead of pre-release

Line 9 pins a pre-release version (0.9.123-b.4). The stable 0.9.123 release is available (2026-01-23). For a binary analysis tool, prefer the stable release to avoid unnecessary instability. Additionally, only beta versions b.1–b.3 are documented; verify that 0.9.123-b.4 actually exists.

🤖 Prompt for AI Agents
In `@mise.toml` at line 9, The manifest pins a pre-release "cargo:cargo-nextest" =
"0.9.123-b.4"; update it to the stable release by replacing the value with
"0.9.123" (i.e., change the "cargo:cargo-nextest" entry to use the stable
0.9.123 version) to avoid using an undocumented pre-release.

"cargo:cargo-dist" = "0.30.3"
"cargo:cargo-llvm-cov" = "0.6.24"
"cargo:cargo-nextest" = "0.9.123-b.4"
"cargo:mdbook" = "0.5.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Avoid duplicate mdbook definitions

Line 10 and Line 24 define mdbook twice via different backends. This can lead to ambiguous tool resolution or double installs. Prefer a single source of truth (either cargo:mdbook or mdbook) and remove the other.

Example (keep cargo backend)
-mdbook                      = "0.5.2"

Also applies to: 24-24

🤖 Prompt for AI Agents
In `@mise.toml` at line 10, The file defines mdbook twice; remove the duplicate
backend entry so only one key remains (either "cargo:mdbook" or "mdbook").
Locate the two keys "cargo:mdbook" and "mdbook" in the mise.toml and delete the
one you do not want to use (prefer keeping "cargo:mdbook" if you need the cargo
backend), ensuring the remaining key retains its version string.

mise.toml Outdated
Comment on lines 18 to 22
claude = "latest"
cyclonedx = "0.29.2"
git-cliff = "2.11.0"
goreleaser = "2.13.3"
just = "latest"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find and read the mise.toml file
find . -name "mise.toml" -type f

Repository: EvilBit-Labs/Stringy

Length of output: 75


🏁 Script executed:

# Once located, read the file with line numbers
cat -n mise.toml | head -30

Repository: EvilBit-Labs/Stringy

Length of output: 1410


🏁 Script executed:

# Check for any Unicode punctuation in mise.toml
file mise.toml
hexdump -C mise.toml | grep -E "e2|e3|e4|e8|e9|ea|eb|ef" | head -20

Repository: EvilBit-Labs/Stringy

Length of output: 152


Pin explicit versions for claude and just to maintain deterministic builds

Lines 18 and 22 use "latest" for claude and just, which breaks build reproducibility across environments and time. Pin specific versions.

Example fix
-claude                      = "latest"
+claude                      = "X.Y.Z"
...
-just                        = "latest"
+just                        = "X.Y.Z"
🤖 Prompt for AI Agents
In `@mise.toml` around lines 18 - 22, The manifest uses "latest" for the claude
and just entries which breaks reproducible builds; update the claude and just
keys to fixed semantic versions (e.g., replace claude = "latest" and just =
"latest" with explicit version strings) so builds are deterministic, ensure the
chosen versions are compatible with the other listed tools (cyclonedx,
git-cliff, goreleaser), and run any dependency lock/update step your repo uses
after changing the versions to commit the resolved state.

Comment on lines 75 to 86
#[derive(Debug, Default)]
pub struct SemanticClassifier;

#[doc(hidden)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct RegexCacheAddresses {
pub guid: usize,
pub email: usize,
pub base64: usize,
pub format: usize,
pub user_agent: usize,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "mod.rs" "src/classification"

Repository: EvilBit-Labs/Stringy

Length of output: 124


🏁 Script executed:

cat -n src/classification/mod.rs

Repository: EvilBit-Labs/Stringy

Length of output: 14179


Add #[non_exhaustive] to public structs.
Both SemanticClassifier and RegexCacheAddresses are public and must be marked #[non_exhaustive] per coding guidelines. Additionally, RegexCacheAddresses should provide an explicit constructor instead of exposing fields directly.

Proposed change
+#[non_exhaustive]
 #[derive(Debug, Default)]
 pub struct SemanticClassifier;

 #[doc(hidden)]
+#[non_exhaustive]
 #[derive(Debug, Clone, Copy, PartialEq, Eq)]
 pub struct RegexCacheAddresses {
🤖 Prompt for AI Agents
In `@src/classification/mod.rs` around lines 75 - 86, Mark both public structs
with #[non_exhaustive] to follow the guideline: add #[non_exhaustive] above
SemanticClassifier and RegexCacheAddresses. For RegexCacheAddresses, make the
fields private (remove pub on guid, email, base64, format, user_agent) and add a
public constructor pub fn new(guid: usize, email: usize, base64: usize, format:
usize, user_agent: usize) -> Self that returns Self { guid, email, base64,
format, user_agent }, keeping the existing derives; update any call sites to use
RegexCacheAddresses::new instead of struct literal field initialization.

…e deprecated semantic module

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@unclesp1d3r unclesp1d3r requested a review from Copilot January 25, 2026 18:14
@unclesp1d3r unclesp1d3r changed the title fix: implement semantic pattern matching guid email base64 printf format and user agent detection feat(classification): add semantic pattern matching for GUID, email, base64, format strings, and user agents Jan 25, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@benches/classification.rs`:
- Around line 16-22: The benchmark named bench_regex_compilation currently only
calls SemanticClassifier::new, which doesn't force lazy_static regex
compilation; either rename the benchmark to reflect it's measuring the
constructor (e.g., classification_constructor) or change its body to trigger
regex initialization explicitly (call a method that uses the regex such as
SemanticClassifier::classify or call the module's regex initializer so
lazy_static/once_cell runs) so the measured work matches the benchmark name;
apply the same change to the duplicate at lines 124-135 as well.
♻️ Duplicate comments (2)
src/classification/mod.rs (1)

75-86: Add #[non_exhaustive] and a constructor for public structs.

Both SemanticClassifier and RegexCacheAddresses are public and should be non_exhaustive. RegexCacheAddresses should also expose a constructor so fields are not the public API surface. This is required for API stability. As per coding guidelines.

Proposed fix
+#[non_exhaustive]
 #[derive(Debug, Default)]
 pub struct SemanticClassifier;

 #[doc(hidden)]
+#[non_exhaustive]
 #[derive(Debug, Clone, Copy, PartialEq, Eq)]
 pub struct RegexCacheAddresses {
-    pub guid: usize,
-    pub email: usize,
-    pub base64: usize,
-    pub format: usize,
-    pub user_agent: usize,
+    guid: usize,
+    email: usize,
+    base64: usize,
+    format: usize,
+    user_agent: usize,
 }
+
+impl RegexCacheAddresses {
+    #[must_use]
+    pub fn new(guid: usize, email: usize, base64: usize, format: usize, user_agent: usize) -> Self {
+        Self {
+            guid,
+            email,
+            base64,
+            format,
+            user_agent,
+        }
+    }
+}
     pub fn regex_cache_addresses(&self) -> RegexCacheAddresses {
-        RegexCacheAddresses {
-            guid: &*GUID_REGEX as *const Regex as usize,
-            email: &*EMAIL_REGEX as *const Regex as usize,
-            base64: &*BASE64_REGEX as *const Regex as usize,
-            format: &*FORMAT_REGEX as *const Regex as usize,
-            user_agent: &*USER_AGENT_REGEX as *const Regex as usize,
-        }
+        RegexCacheAddresses::new(
+            &*GUID_REGEX as *const Regex as usize,
+            &*EMAIL_REGEX as *const Regex as usize,
+            &*BASE64_REGEX as *const Regex as usize,
+            &*FORMAT_REGEX as *const Regex as usize,
+            &*USER_AGENT_REGEX as *const Regex as usize,
+        )
     }

Also applies to: 178-187

tests/classification_integration_tests.rs (1)

148-174: Avoid hard wall clock assertions in tests.

This can be flaky across CI runners. Prefer moving the performance check to Criterion or mark it ignored for CI. As per coding guidelines.

Proposed fix
-#[test]
-fn test_classification_performance() {
+#[test]
+#[ignore]
+fn test_classification_performance() {

Comment on lines 16 to 22
fn bench_regex_compilation(c: &mut Criterion) {
c.bench_function("classification_regex_compilation", |b| {
b.iter(|| {
let _ = SemanticClassifier::new();
});
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Rename or rework the regex compilation benchmark.

SemanticClassifier::new does not trigger lazy_static regex compilation, so this benchmark measures a trivial constructor, not regex compilation. Rename it or trigger regex use explicitly so the measurement matches its name.

Proposed fix
-fn bench_regex_compilation(c: &mut Criterion) {
-    c.bench_function("classification_regex_compilation", |b| {
+fn bench_classifier_init(c: &mut Criterion) {
+    c.bench_function("classification_classifier_init", |b| {
         b.iter(|| {
             let _ = SemanticClassifier::new();
         });
     });
 }
 criterion_group!(
     classification_benches,
-    bench_regex_compilation,
+    bench_classifier_init,
     bench_guid_classification,
     bench_email_classification,
     bench_base64_classification,
     bench_format_string_classification,
     bench_user_agent_classification,
     bench_batch_classification,
     bench_worst_case,
     bench_context_creation
 );

Also applies to: 124-135

🤖 Prompt for AI Agents
In `@benches/classification.rs` around lines 16 - 22, The benchmark named
bench_regex_compilation currently only calls SemanticClassifier::new, which
doesn't force lazy_static regex compilation; either rename the benchmark to
reflect it's measuring the constructor (e.g., classification_constructor) or
change its body to trigger regex initialization explicitly (call a method that
uses the regex such as SemanticClassifier::classify or call the module's regex
initializer so lazy_static/once_cell runs) so the measured work matches the
benchmark name; apply the same change to the duplicate at lines 124-135 as well.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Comment on lines 65 to 69
let classifier = SemanticClassifier::new();
let found_string = FoundString {
text: "C:\\Windows\\System32\\cmd.exe".to_string(),
let context = StringContext {
section_type: SectionType::StringData,
section_name: Some(".rodata".to_string()),
binary_format: BinaryFormat::Elf,
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example constructs StringContext with a struct literal, but StringContext is #[non_exhaustive], so external users cannot instantiate it this way. Update the example to use StringContext::new(...) plus builder methods (e.g., with_section_name) so the docs compile and match the intended construction pattern.

Copilot uses AI. Check for mistakes.
Comment on lines +45 to 47
pub mod patterns;
pub mod ranking;
pub mod semantic;
pub mod symbols;
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub mod semantic was removed, which breaks users importing from stringy::classification::semantic::* (this module was previously public). If you want to preserve source compatibility, consider reintroducing a small semantic module that re-exports SemanticClassifier (and any still-supported helpers), and optionally deprecate it.

Copilot uses AI. Check for mistakes.
…ing and regex caching

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/classification/patterns/network.rs (1)

27-57: Domain detection misses newer TLDs due to incomplete allowlist.

has_valid_tld only accepts domains in COMMON_TLDS, causing false negatives for valid but unlisted TLDs (e.g., .systems, .network, .photography). For a binary analysis tool, this reduces detection coverage for command-and-control infrastructure and other malicious domains. Consider a fallback heuristic: if a TLD is 2-24 characters and all alphabetic, treat it as valid rather than rejecting it outright.

Suggested fallback
 if let Some(dot_pos) = domain.rfind('.') {
     let tld = &domain[dot_pos + 1..];
     let tld_lower = tld.to_lowercase();
-    COMMON_TLDS.contains(tld_lower.as_str())
+    if COMMON_TLDS.contains(tld_lower.as_str()) {
+        true
+    } else {
+        (2..=24).contains(&tld.len()) && tld.chars().all(|c| c.is_ascii_alphabetic())
+    }
 } else {
     false
 }
🤖 Fix all issues with AI agents
In `@src/types/error.rs`:
- Around line 12-44: Replace the flat String-based ParseError variants and
manual From impls by adding dedicated error-typed variants on the StringyError
enum and deriving #[from] where appropriate so original error sources are
preserved: change ParseError(String) to ParseError(#[from] goblin::error::Error)
and add variants like PeliteError(#[from] pelite::Error) and
ResourceFindError(#[from] pelite::resources::FindError) (or similarly-named
variants) and remove the manual impl From<...> blocks for goblin and pelite;
keep other variants (EncodingError, ConfigError, MemoryMapError) as-is and keep
the Result<T> = std::result::Result<T, StringyError> alias.

In `@tests/classification_integration_tests.rs`:
- Around line 24-105: Update the three tests test_elf_string_classification,
test_pe_string_classification, and test_macho_string_classification to record
insta snapshots of the classifier results instead of only asserting individual
tags: after creating SemanticClassifier::new(), creating the context via
create_test_context(...), and calling classifier.classify(...), collect the
returned tag sets (for each input string like guid, email, format_string,
base64, user_agent) into a serializable structure (e.g., Vec or HashSet of Tag
names) and replace or augment the assert!(tags.contains(...)) checks with insta
snapshot assertions (e.g., insta::assert_yaml_snapshot!) to persist the expected
classification outputs for each fixture-driven test; keep at least one explicit
assert per test if desired but ensure snapshotting uses the classifier variable
and context used in each test (symbols: SemanticClassifier::new,
create_test_context, classifier.classify, and the test functions named above).

In `@tests/classification_tests.rs`:
- Around line 177-226: Extract repeated FoundString construction into a small
helper (e.g., make_found_string or build_found_string) used by
test_classify_found_string_backward_compatibility: the helper should accept
parameters used in the diffs (string content, Encoding, offset, length,
StringSource, optional section) and return a FoundString (created via
FoundString::new and .with_section when needed); replace the three inline
FoundString::new(...) constructions in the test with calls to this helper and
keep calls to classifier.classify_found_string(&found_...) unchanged to preserve
assertions for Tag::Guid, Tag::Email, and Tag::FormatString.
♻️ Duplicate comments (2)
tests/classification_integration_tests.rs (1)

148-170: Avoid hard wall-clock thresholds in unit tests.

The elapsed < Duration::from_secs(2) assertion is prone to CI variance. This was already flagged in prior review; consider moving it to benchmarks, marking as ignored, or loosening/removing the timing check.

src/classification/mod.rs (1)

75-77: Add #[non_exhaustive] to public SemanticClassifier.

Public structs should be #[non_exhaustive] for API stability. This was already flagged in prior review. As per coding guidelines.

Proposed change
-#[derive(Debug, Default)]
+#[non_exhaustive]
+#[derive(Debug, Default)]
 pub struct SemanticClassifier;

Comment on lines +12 to +44
#[error("Binary parsing error: {0}")]
ParseError(String),

#[error("Invalid encoding in string at offset {offset}")]
EncodingError { offset: u64 },

#[error("Configuration error: {0}")]
ConfigError(String),

#[error("Memory mapping error: {0}")]
MemoryMapError(String),
}

/// Result type alias for the stringy library
pub type Result<T> = std::result::Result<T, StringyError>;

impl From<goblin::error::Error> for StringyError {
fn from(err: goblin::error::Error) -> Self {
StringyError::ParseError(err.to_string())
}
}

impl From<pelite::Error> for StringyError {
fn from(err: pelite::Error) -> Self {
StringyError::ParseError(err.to_string())
}
}

impl From<pelite::resources::FindError> for StringyError {
fn from(err: pelite::resources::FindError) -> Self {
StringyError::ParseError(format!("Resource lookup error: {}", err))
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Preserve parse error sources instead of flattening to String.

Converting goblin and pelite errors to String loses the error chain and makes debugging harder. Prefer dedicated variants with #[from] to preserve sources. Based on coding guidelines.

Proposed refactor to preserve sources
 pub enum StringyError {
@@
-    #[error("Binary parsing error: {0}")]
-    ParseError(String),
+    #[error("Goblin parse error: {0}")]
+    GoblinError(#[from] goblin::error::Error),
+
+    #[error("PE parse error: {0}")]
+    PeliteError(#[from] pelite::Error),
+
+    #[error("PE resource lookup error: {0}")]
+    PeliteResourceError(#[from] pelite::resources::FindError),
@@
-impl From<goblin::error::Error> for StringyError {
-    fn from(err: goblin::error::Error) -> Self {
-        StringyError::ParseError(err.to_string())
-    }
-}
-
-impl From<pelite::Error> for StringyError {
-    fn from(err: pelite::Error) -> Self {
-        StringyError::ParseError(err.to_string())
-    }
-}
-
-impl From<pelite::resources::FindError> for StringyError {
-    fn from(err: pelite::resources::FindError) -> Self {
-        StringyError::ParseError(format!("Resource lookup error: {}", err))
-    }
-}
🤖 Prompt for AI Agents
In `@src/types/error.rs` around lines 12 - 44, Replace the flat String-based
ParseError variants and manual From impls by adding dedicated error-typed
variants on the StringyError enum and deriving #[from] where appropriate so
original error sources are preserved: change ParseError(String) to
ParseError(#[from] goblin::error::Error) and add variants like
PeliteError(#[from] pelite::Error) and ResourceFindError(#[from]
pelite::resources::FindError) (or similarly-named variants) and remove the
manual impl From<...> blocks for goblin and pelite; keep other variants
(EncodingError, ConfigError, MemoryMapError) as-is and keep the Result<T> =
std::result::Result<T, StringyError> alias.

Comment on lines +24 to +105
#[test]
fn test_elf_string_classification() {
let fixture_path = get_fixture_path("test_binary_elf");
let elf_data = fs::read(&fixture_path)
.expect("Failed to read ELF fixture. Run the build script to generate fixtures.");

assert!(ElfParser::detect(&elf_data), "ELF detection should succeed");
let parser = ElfParser::new();
let container_info = parser.parse(&elf_data).expect("Failed to parse ELF");

assert_eq!(container_info.format, BinaryFormat::Elf);

let classifier = SemanticClassifier::new();
let context = create_test_context(
BinaryFormat::Elf,
SectionType::StringData,
StringSource::SectionData,
);

let guid = "{12345678-1234-1234-1234-123456789abc}";
let tags = classifier.classify(guid, &context);
assert!(tags.contains(&Tag::Guid));

let email = "admin@malware.com";
let tags = classifier.classify(email, &context);
assert!(tags.contains(&Tag::Email));

let format_string = "Error: %s at line %d";
let tags = classifier.classify(format_string, &context);
assert!(tags.contains(&Tag::FormatString));
}

#[test]
fn test_pe_string_classification() {
let fixture_path = get_fixture_path("test_binary_pe.exe");
let pe_data = fs::read(&fixture_path)
.expect("Failed to read PE fixture. Run the build script to generate fixtures.");

assert!(PeParser::detect(&pe_data), "PE detection should succeed");
let parser = PeParser::new();
let container_info = parser.parse(&pe_data).expect("Failed to parse PE");

assert_eq!(container_info.format, BinaryFormat::Pe);

let classifier = SemanticClassifier::new();
let context = create_test_context(
BinaryFormat::Pe,
SectionType::Resources,
StringSource::ResourceString,
);

let base64 = "U29tZSBsb25nZXIgYmFzZTY0IHN0cmluZw==";
let tags = classifier.classify(base64, &context);
assert!(tags.contains(&Tag::Base64));

let user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36";
let tags = classifier.classify(user_agent, &context);
assert!(tags.contains(&Tag::UserAgent));
}

#[test]
fn test_macho_string_classification() {
let fixture_path = get_fixture_path("test_binary_macho");
let macho_data = fs::read(&fixture_path)
.expect("Failed to read Mach-O fixture. Run the build script to generate fixtures.");

assert!(
MachoParser::detect(&macho_data),
"Mach-O detection should succeed"
);
let parser = MachoParser::new();
let container_info = parser.parse(&macho_data).expect("Failed to parse Mach-O");

assert_eq!(container_info.format, BinaryFormat::MachO);

let classifier = SemanticClassifier::new();
let context = create_test_context(
BinaryFormat::MachO,
SectionType::StringData,
StringSource::SectionData,
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if the test file exists and view its structure
fd "classification_integration_tests.rs" tests/

# Count lines to understand file size
wc -l tests/classification_integration_tests.rs 2>/dev/null || echo "File not found"

# Check for any insta usage in the test file
rg "insta" tests/classification_integration_tests.rs 2>/dev/null || echo "No insta found"

# Look for existing snapshot files
fd "\.snap" tests/ 2>/dev/null | head -20 || echo "No snapshots found"

# Check the test structure more broadly
rg "#\[test\]" tests/classification_integration_tests.rs -A 2 2>/dev/null || echo "No test macros found"

Repository: EvilBit-Labs/Stringy

Length of output: 2004


🏁 Script executed:

# Let me examine the test file to understand current structure
head -50 tests/classification_integration_tests.rs

# Check if there are any imports of insta in the file
rg "use insta" tests/classification_integration_tests.rs

# Check if the snapshot file is referenced or used
rg "classification_snapshots\|assert_debug_snapshot\|assert_snapshot" tests/classification_integration_tests.rs

# Look at the full test file to understand the structure
wc -l tests/classification_integration_tests.rs
cat tests/classification_integration_tests.rs

Repository: EvilBit-Labs/Stringy

Length of output: 7547


🏁 Script executed:

# Check the existing snapshot file to understand the pattern
cat tests/snapshots/classification_integration__classification_snapshots.snap

# Verify if other test files in the repository use insta snapshots
rg "assert_debug_snapshot\|assert_snapshot" tests/ -l | head -5

# Check one of the output integration tests to see the insta usage pattern
rg "assert_.*_snapshot" tests/output_json_integration_tests.rs -B 2 -A 2 | head -30

Repository: EvilBit-Labs/Stringy

Length of output: 733


🏁 Script executed:

# Check if there's a different classification_integration.rs file
fd "classification_integration\.rs" tests/

# Look for where the snapshot is actually used
rg "classification_snapshots" tests/

# Check git history or structure to understand the relationship
git log --oneline tests/classification_integration* 2>/dev/null | head -10

# Check if insta is even in the dependencies
rg "insta" Cargo.toml tests/Cargo.toml 2>/dev/null | head -20

# Look at the output_json test to see insta usage pattern
fd "output_json_integration" tests/ -type f

Repository: EvilBit-Labs/Stringy

Length of output: 491


🏁 Script executed:

# Look at the actual classification_integration.rs file that uses snapshots
cat tests/classification_integration.rs

# Compare file sizes
wc -l tests/classification_integration.rs tests/classification_integration_tests.rs

# Check imports in classification_integration.rs
head -20 tests/classification_integration.rs

Repository: EvilBit-Labs/Stringy

Length of output: 4011


Add insta snapshots for fixture-driven tests.

Tests loading binary fixtures must use insta snapshots per coding guidelines. The three ELF/PE/Mach-O format tests (lines 24-105) load fixtures from tests/fixtures/ but verify classifications only with assert!(tags.contains(...)). Snapshot the classified tag sets per format to capture expected behavior and prevent regressions.

Example refactor for ELF test
 use std::fs;
 use std::time::{Duration, Instant};
+use insta::assert_debug_snapshot;

 use stringy::classification::SemanticClassifier;
 use stringy::container::{ContainerParser, ElfParser, MachoParser, PeParser};
 use stringy::types::{BinaryFormat, Encoding, SectionType, StringContext, StringSource, Tag};
@@
     let guid = "{12345678-1234-1234-1234-123456789abc}";
     let tags = classifier.classify(guid, &context);
     assert!(tags.contains(&Tag::Guid));
+    assert_debug_snapshot!("elf_guid_tags", tags);
     
     let email = "admin@malware.com";
     let tags = classifier.classify(email, &context);
     assert!(tags.contains(&Tag::Email));
+    assert_debug_snapshot!("elf_email_tags", tags);
🤖 Prompt for AI Agents
In `@tests/classification_integration_tests.rs` around lines 24 - 105, Update the
three tests test_elf_string_classification, test_pe_string_classification, and
test_macho_string_classification to record insta snapshots of the classifier
results instead of only asserting individual tags: after creating
SemanticClassifier::new(), creating the context via create_test_context(...),
and calling classifier.classify(...), collect the returned tag sets (for each
input string like guid, email, format_string, base64, user_agent) into a
serializable structure (e.g., Vec or HashSet of Tag names) and replace or
augment the assert!(tags.contains(...)) checks with insta snapshot assertions
(e.g., insta::assert_yaml_snapshot!) to persist the expected classification
outputs for each fixture-driven test; keep at least one explicit assert per test
if desired but ensure snapshotting uses the classifier variable and context used
in each test (symbols: SemanticClassifier::new, create_test_context,
classifier.classify, and the test functions named above).

Comment on lines +177 to +226
#[test]
fn test_classify_found_string_backward_compatibility() {
let classifier = SemanticClassifier::new();

// Test GUID classification via FoundString
let found_guid = FoundString::new(
"{12345678-1234-1234-1234-123456789abc}".to_string(),
Encoding::Ascii,
0,
38,
StringSource::SectionData,
)
.with_section(".rodata".to_string());

let tags = classifier.classify_found_string(&found_guid);
assert!(
tags.contains(&Tag::Guid),
"GUID should be detected via classify_found_string"
);

// Test email classification via FoundString
let found_email = FoundString::new(
"admin@example.com".to_string(),
Encoding::Ascii,
100,
17,
StringSource::SectionData,
);

let tags = classifier.classify_found_string(&found_email);
assert!(
tags.contains(&Tag::Email),
"Email should be detected via classify_found_string"
);

// Test format string classification via FoundString
let found_format = FoundString::new(
"Error: %s at line %d".to_string(),
Encoding::Ascii,
200,
20,
StringSource::SectionData,
);

let tags = classifier.classify_found_string(&found_format);
assert!(
tags.contains(&Tag::FormatString),
"Format string should be detected via classify_found_string"
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Factor repeated FoundString construction into a helper.

Multiple tests hand-build FoundString instances. A small helper reduces duplication and keeps length/offset usage consistent.

Example helper
 use stringy::types::{
     BinaryFormat, Encoding, FoundString, SectionType, StringContext, StringSource, Tag,
 };
@@
+fn make_found_string(text: &str, offset: u64, length: u32) -> FoundString {
+    FoundString::new(
+        text.to_string(),
+        Encoding::Ascii,
+        offset,
+        length,
+        StringSource::SectionData,
+    )
+}
@@
-    let found_guid = FoundString::new(
-        "{12345678-1234-1234-1234-123456789abc}".to_string(),
-        Encoding::Ascii,
-        0,
-        38,
-        StringSource::SectionData,
-    )
-    .with_section(".rodata".to_string());
+    let found_guid = make_found_string(
+        "{12345678-1234-1234-1234-123456789abc}",
+        0,
+        38,
+    )
+    .with_section(".rodata".to_string());
🤖 Prompt for AI Agents
In `@tests/classification_tests.rs` around lines 177 - 226, Extract repeated
FoundString construction into a small helper (e.g., make_found_string or
build_found_string) used by test_classify_found_string_backward_compatibility:
the helper should accept parameters used in the diffs (string content, Encoding,
offset, length, StringSource, optional section) and return a FoundString
(created via FoundString::new and .with_section when needed); replace the three
inline FoundString::new(...) constructions in the test with calls to this helper
and keep calls to classifier.classify_found_string(&found_...) unchanged to
preserve assertions for Tag::Guid, Tag::Email, and Tag::FormatString.

unclesp1d3r and others added 2 commits January 25, 2026 15:16
…terns, and file structure

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Add missing `command: zigbuild` for cross-compilation in goreleaser
- Remove `rustup default stable` hook (toolchain pinned via rust-toolchain.toml)
- Replace lazy_static with once_cell::sync::Lazy for consistency
- Remove lazy_static dependency from Cargo.toml
- Fix documentation to use StringContext::new() constructor pattern
- Add missing semantic categories (URLs, domains, IPs, paths) to docs
- Remove flaky wall-clock timing assertion from integration tests
- Rename misleading benchmark from regex_compilation to classifier_construction
- Remove duplicate mdbook definition and pin just version in mise.toml

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@AGENTS.md`:
- Around line 88-91: Replace the bold label "**New semantic tag**:" with a
proper Markdown heading (e.g., "### New semantic tag") or a list item to satisfy
MD036; keep the rest of the sentence ("Add variant to `Tag` enum in
`types/mod.rs`, implement pattern in `classification/patterns/` or
`classification/mod.rs`") unchanged and ensure the heading level aligns with
surrounding headings in AGENTS.md so references to the Tag enum (`Tag`),
types/mod.rs, and classification/patterns/ or classification/mod.rs remain
clear.

In `@benches/classification.rs`:
- Around line 84-102: In bench_batch_classification replace the mixed-format
string "Error %s at line {}" so it matches the classifier's documented patterns
(use % specifiers or numeric {digits}); update the format! call that populates
samples (inside the loop that builds samples) to use a numeric placeholder like
"Error %s at line {0}" (still passing index), so the sample vector contains the
intended %-style pattern and the classifier.classify calls
(classifier.classify(..., &context)) will see the correct mix of formats.

In `@docs/src/classification.md`:
- Around line 13-17: The inline code span for the URL regex under the "### URLs"
section contains an unescaped backtick and a stray trailing backslash which
breaks rendering; update the code span for the pattern (the string starting with
`https?://[^\s<>"{}|\\\^\[\]\`\]+\`) to use a double-backtick code span and
remove the trailing backslash so the backtick inside the character class is
preserved and the pattern renders correctly.
♻️ Duplicate comments (7)
.goreleaser.yaml (1)

11-16: Pin cargo-zigbuild in the pre-build hook.
cargo install without --version makes release builds non-deterministic and can drift from the toolchain pinned in mise.toml. Align the hook to a fixed version.

Proposed fix
-    - cargo install --locked cargo-zigbuild
+    - cargo install --locked --version 0.21.2 cargo-zigbuild
#!/bin/bash
# Confirm the pinned cargo-zigbuild version in mise.toml
rg -n '"cargo:cargo-zigbuild"' mise.toml
mise.toml (2)

9-9: Use stable cargo-nextest instead of a pre-release.
The pre-release pin increases instability and may not be published. Prefer the stable release.

Proposed fix
-"cargo:cargo-nextest"       = "0.9.123-b.4"
+"cargo:cargo-nextest"       = "0.9.123"
What is the latest stable cargo-nextest version, and does 0.9.123-b.4 exist?

18-18: Pin claude to a fixed version for reproducible toolchains.
Using "latest" makes local and CI environments drift over time.

Proposed fix
-claude                      = "latest"
+claude                      = "X.Y.Z"
What is the latest stable version string for the claude CLI (mise tool)?
src/classification/mod.rs (2)

40-76: Use lazy_static! for regex caching in classification modules.
This module uses once_cell::sync::Lazy, but the project rules require lazy_static! for regex caches. Convert the statics and ensure lazy_static is present in Cargo.toml (remove once_cell here if no longer needed). As per coding guidelines, align classification regex caching to lazy_static.

Proposed fix
-use once_cell::sync::Lazy;
+use lazy_static::lazy_static;
 use regex::Regex;
 
-static GUID_REGEX: Lazy<Regex> = Lazy::new(|| {
-    Regex::new(r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$")
-        .expect("Invalid GUID regex")
-});
-
-static EMAIL_REGEX: Lazy<Regex> = Lazy::new(|| {
-    Regex::new(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$").expect("Invalid email regex")
-});
-
-static BASE64_REGEX: Lazy<Regex> =
-    Lazy::new(|| Regex::new(r"^[A-Za-z0-9+/]{20,}={0,2}$").expect("Invalid base64 regex"));
-
-static FORMAT_REGEX: Lazy<Regex> =
-    Lazy::new(|| Regex::new(r"%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}").expect("Invalid format regex"));
-
-static USER_AGENT_REGEX: Lazy<Regex> = Lazy::new(|| {
-    Regex::new(r"(Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+)")
-        .expect("Invalid user agent regex")
-});
+lazy_static! {
+    static ref GUID_REGEX: Regex = Regex::new(
+        r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$"
+    )
+    .expect("Invalid GUID regex");
+
+    static ref EMAIL_REGEX: Regex =
+        Regex::new(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
+            .expect("Invalid email regex");
+
+    static ref BASE64_REGEX: Regex =
+        Regex::new(r"^[A-Za-z0-9+/]{20,}={0,2}$").expect("Invalid base64 regex");
+
+    static ref FORMAT_REGEX: Regex =
+        Regex::new(r"%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}").expect("Invalid format regex");
+
+    static ref USER_AGENT_REGEX: Regex = Regex::new(
+        r"(Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+)"
+    )
+    .expect("Invalid user agent regex");
+}
#!/bin/bash
# Identify lazy_static vs once_cell usage in classification modules
rg -n "lazy_static!|once_cell::sync::Lazy" src/classification

78-80: Mark SemanticClassifier as #[non_exhaustive].
This is a public struct; add #[non_exhaustive] to keep the API forward-compatible. As per coding guidelines, public structs should be non-exhaustive.

Proposed fix
+#[non_exhaustive]
 #[derive(Debug, Default)]
 pub struct SemanticClassifier;
tests/classification_integration_tests.rs (1)

23-112: Add insta snapshots for fixture-driven classification tests.
These tests load fixtures but only assert individual tags. The test guidelines require snapshot coverage for fixture-driven outputs to prevent regressions. As per coding guidelines, add insta snapshots for the tag sets.

Proposed fix (example for ELF test)
+use insta::assert_debug_snapshot;
@@
     let guid = "{12345678-1234-1234-1234-123456789abc}";
     let tags = classifier.classify(guid, &context);
     assert!(tags.contains(&Tag::Guid));
+    assert_debug_snapshot!("elf_guid_tags", tags);
docs/src/classification.md (1)

124-133: Fix integration test path reference.

The integration test file is tests/classification_integration.rs, not tests/classification_integration_tests.rs.

Proposed fix
-- Integration tests: `tests/classification_integration_tests.rs`
+- Integration tests: `tests/classification_integration.rs`

Comment on lines 88 to 91
## Adding Features

**New semantic tag**: Add variant to `Tag` enum in `types.rs`, implement pattern in `classification/semantic.rs`
**New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix MD036 by using a heading instead of bold text.
Markdownlint reports emphasis used as a heading. Convert the bold label to a heading or list item.

Proposed fix
-**New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`
+### New semantic tag
+Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Adding Features
**New semantic tag**: Add variant to `Tag` enum in `types.rs`, implement pattern in `classification/semantic.rs`
**New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`
## Adding Features
### New semantic tag
Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`
🤖 Prompt for AI Agents
In `@AGENTS.md` around lines 88 - 91, Replace the bold label "**New semantic
tag**:" with a proper Markdown heading (e.g., "### New semantic tag") or a list
item to satisfy MD036; keep the rest of the sentence ("Add variant to `Tag` enum
in `types/mod.rs`, implement pattern in `classification/patterns/` or
`classification/mod.rs`") unchanged and ensure the heading level aligns with
surrounding headings in AGENTS.md so references to the Tag enum (`Tag`),
types/mod.rs, and classification/patterns/ or classification/mod.rs remain
clear.

Comment on lines +84 to +102
fn bench_batch_classification(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();

let mut samples = Vec::new();
for index in 0..1000 {
samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index));
samples.push(format!("user{}@example.com", index));
samples.push(format!("Error %s at line {}", index));
}

c.bench_function("classification_batch", |b| {
b.iter(|| {
for sample in &samples {
let _ = classifier.classify(black_box(sample.as_str()), &context);
}
});
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Batch sample misses a format-string match.

"Error %s at line {}" uses {} which does not match the documented format-string patterns that rely on % specifiers or {digits}. This reduces the intended mix of categories in the batch set.

Proposed fix
-        samples.push(format!("Error %s at line {}", index));
+        samples.push(format!("Error %s at line %d {}", index));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn bench_batch_classification(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();
let mut samples = Vec::new();
for index in 0..1000 {
samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index));
samples.push(format!("user{}@example.com", index));
samples.push(format!("Error %s at line {}", index));
}
c.bench_function("classification_batch", |b| {
b.iter(|| {
for sample in &samples {
let _ = classifier.classify(black_box(sample.as_str()), &context);
}
});
});
}
fn bench_batch_classification(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();
let mut samples = Vec::new();
for index in 0..1000 {
samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index));
samples.push(format!("user{}@example.com", index));
samples.push(format!("Error %s at line %d {}", index));
}
c.bench_function("classification_batch", |b| {
b.iter(|| {
for sample in &samples {
let _ = classifier.classify(black_box(sample.as_str()), &context);
}
});
});
}
🤖 Prompt for AI Agents
In `@benches/classification.rs` around lines 84 - 102, In
bench_batch_classification replace the mixed-format string "Error %s at line {}"
so it matches the classifier's documented patterns (use % specifiers or numeric
{digits}); update the format! call that populates samples (inside the loop that
builds samples) to use a numeric placeholder like "Error %s at line {0}" (still
passing index), so the sample vector contains the intended %-style pattern and
the classifier.classify calls (classifier.classify(..., &context)) will see the
correct mix of formats.

Comment on lines +13 to +17
### URLs

#### URLs
- Pattern: `https?://[^\s<>"{}|\\\^\[\]\`\]+\`
- Examples: `https://example.com/path`, `http://malware.site/payload`
- Validation: Must start with `http://` or `https://`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

URL pattern code span breaks on backtick.

The inline code span contains a backtick in the character class, which terminates the code span and renders incorrectly. Use a double-backtick code span and remove the stray trailing backslash.

Proposed fix
-- Pattern: `https?://[^\s<>"{}|\\\^\[\]\`\]+\`
+- Pattern: ``https?://[^\s<>"{}|\\\^\[\]\`]+``
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### URLs
#### URLs
- Pattern: `https?://[^\s<>"{}|\\\^\[\]\`\]+\`
- Examples: `https://example.com/path`, `http://malware.site/payload`
- Validation: Must start with `http://` or `https://`
### URLs
- Pattern: ``https?://[^\s<>"{}|\\\^\[\]\`]+``
- Examples: `https://example.com/path`, `http://malware.site/payload`
- Validation: Must start with `http://` or `https://`
🤖 Prompt for AI Agents
In `@docs/src/classification.md` around lines 13 - 17, The inline code span for
the URL regex under the "### URLs" section contains an unescaped backtick and a
stray trailing backslash which breaks rendering; update the code span for the
pattern (the string starting with `https?://[^\s<>"{}|\\\^\[\]\`\]+\`) to use a
double-backtick code span and remove the trailing backslash so the backtick
inside the character class is preserved and the pattern renders correctly.

@unclesp1d3r unclesp1d3r merged commit e118ffa into main Jan 26, 2026
22 checks passed
@unclesp1d3r unclesp1d3r deleted the 18-implement-semantic-pattern-matching-guid-email-base64-printf-format-and-user-agent-detection branch January 26, 2026 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Semantic Pattern Matching: GUID, Email, Base64, Printf Format, and User Agent Detection

2 participants