-
-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Implement enum-based output formatters (Table, JSON, YARA) #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Adds a human-readable table output formatter for extracted strings, with automatic TTY detection. In TTY mode, outputs an aligned table with headers and truncated columns; in non-TTY mode, outputs plain string text, one per line. Includes dynamic column width calculation, tag formatting, Unicode-safe truncation, and comprehensive integration and unit tests with snapshot verification.
Added full implementations for `format_json` and `format_yara` functions, including serialization logic, escaping, and tag/group handling. Updated documentation and examples to reflect new UTF-16 extraction API and output formatting. Added comprehensive integration and snapshot tests for both output formats to ensure correctness and stability.
- Introduced an optional `generated_at` field in `OutputMetadata` for deterministic outputs. - Updated YARA formatter to utilize the `generated_at` timestamp instead of runtime timestamps. - Adjusted integration tests and snapshots to reflect changes in generated timestamps. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughArchitectural OverviewThis PR introduces a dedicated, extensible output-formatting subsystem and documentation/tests that separate presentation from extraction/classification and enable pluggable formatters (Table, JSONL, YARA) dispatched by a central API. Key Architectural Changes
Public API & Type Changes
Deduplication & Internal Data Handling
Binary Format & Extraction Impact
Semantic Classification & Preservation
Testing, Documentation & Developer Experience
Notable Behavioral Constraints & Validations
Overall ImpactThis PR substantially strengthens Stringy's presentation layer by adding a robust, extensible formatting subsystem that preserves extraction/classification semantics, clarifies consumer ergonomics (FoundString::new, crate-level re-exports), enables deterministic outputs via metadata, and increases confidence through comprehensive tests and updated documentation. WalkthroughAdds a new output formatting subsystem (JSONL, table, YARA) with metadata-driven dispatch, updates doc examples to use FoundString::new(...), refactors dedup grouping to use encoding-typed keys, and introduces examples, integration tests, tooling, and documentation updates. Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant Dispatcher as format_output(strings, metadata)
participant Meta as OutputMetadata
participant Formatter as Formatter (JSON/Table/YARA)
participant Strings as FoundString[]
Client->>Dispatcher: format_output(strings, metadata)
Dispatcher->>Meta: inspect metadata.format
alt format == Json
Dispatcher->>Formatter: format_json(strings, metadata)
else format == Table
Dispatcher->>Formatter: format_table(strings, metadata)
else format == Yara
Dispatcher->>Formatter: format_yara(strings, metadata)
end
Formatter->>Strings: process encodings, tags, scores
Formatter-->>Dispatcher: Result<String>
Dispatcher-->>Client: formatted output
sequenceDiagram
actor Caller
participant YaraFmt as format_yara()
participant Grouper as Tag Grouper
participant Escaper as Escaping Utils
participant Builder as Rule Builder
Caller->>YaraFmt: strings, metadata
YaraFmt->>Grouper: group strings by tag/encoding/length
Grouper-->>YaraFmt: HashMap<Tag, Vec<FoundString>>
loop per tag group
YaraFmt->>Escaper: escape / utf16 hex / sanitize names
Escaper-->>YaraFmt: escaped literals
YaraFmt->>Builder: append string var + comment + score
end
YaraFmt->>Builder: add meta + condition
Builder-->>Caller: completed YARA rule
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 6 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements comprehensive output formatting capabilities for Stringy, adding JSON, YARA, and Table formatters with extensive test coverage. The PR also updates documentation examples across multiple modules to use the FoundString::new constructor pattern for improved consistency and clarity.
Changes:
- Implements three output formatters (JSON/JSONL, YARA rule templates, and human-readable tables) with TTY detection and format-specific features
- Adds comprehensive integration tests with snapshot testing for all three formatters covering edge cases, encoding variations, and special characters
- Updates documentation examples across classification and extraction modules to consistently use constructor patterns instead of manual struct initialization
Reviewed changes
Copilot reviewed 64 out of 64 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/output/mod.rs | Core output dispatcher with OutputFormat enum and OutputMetadata struct |
| src/output/json.rs | JSONL formatter with proper escaping and optional field handling |
| src/output/table.rs | Table formatter with TTY detection, column alignment, and truncation |
| src/output/yara.rs | YARA rule generator with string grouping, sanitization, and encoding modifiers |
| tests/output_json_integration.rs | Comprehensive JSON formatter tests (266 lines) |
| tests/output_table_integration.rs | Comprehensive table formatter tests (396 lines) |
| tests/output_yara_integration.rs | Comprehensive YARA formatter tests (197 lines) |
| tests/snapshots/* | 37 new snapshot files for output format verification |
| src/lib.rs | Re-exports output types and updates UTF-16 function name references |
| src/extraction/* | Updates documentation examples and function signatures |
| src/classification/* | Updates documentation to use FoundString::new constructor |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Added 'superpowers@claude-plugins-official' to the enabledPlugins list in .claude/settings.json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/extraction/pe_resources.rs (1)
1-1449: File exceeds 500-600 line limit.At 1449 lines, this file is nearly 3x the recommended maximum. Per coding guidelines, consider splitting into focused modules:
pe_resources/mod.rs- public API and orchestration (extract_resources,extract_resource_strings)pe_resources/version_info.rs- VERSIONINFO detection and extractionpe_resources/string_table.rs- STRINGTABLE parsing logicpe_resources/manifest.rs- MANIFEST detection and encoding handlingpe_resources/tests.rsortests/pe_resources.rs- test cases (836+ lines of tests alone)
🤖 Fix all issues with AI agents
In `@src/output/json.rs`:
- Around line 6-9: In format_json, remove the misleading underscore from the
used parameter (_strings) so the signature becomes format_json(strings:
&[FoundString], metadata: &OutputMetadata) and update all references to use
strings; for the currently unused metadata parameter decide one of two fixes: if
metadata should be used, incorporate it into the output (for example add a JSON
header or comment with metadata fields in the JSONL output) and reference it in
the function body, or if metadata is intentionally unused remove the parameter
entirely from format_json (or rename to _metadata to explicitly signal
intentional unusedness); update any callers of format_json accordingly.
In `@src/output/table.rs`:
- Around line 371-708: The file-level test module (#[cfg(test)] mod tests { ...
}) makes src/output/table.rs exceed the size limit; move this entire test module
into a separate test file (either create src/output/table/tests.rs and replace
the inline module with #[cfg(test)] mod tests; or create an integration test at
tests/output_table.rs). Copy all helper functions (make_test_string,
make_metadata) and all nested test modules (format_tags_tests,
truncate_string_tests, pad_string_tests, format_table_tests, column_width_tests)
into the new file, update imports so the tests bring the table module symbols
(e.g., format_tags, tag_to_display_string, truncate_string, pad_string,
format_table_with_mode, calculate_section_width, calculate_tags_width,
FoundString, Tag, OutputMetadata, Alignment, SECTION_COLUMN_WIDTH) into scope
via use crate::output::table::* or the appropriate paths, and ensure the
#[cfg(test)] gating remains in the new test file; finally remove the large
inline tests block from src/output/table.rs so the file falls under the size
limit.
- Around line 218-244: The docs promise limiting tags by MAX_TAGS_DISPLAY but
format_tags currently picks all tags with the highest boost then truncates by
width; change format_tags to instead sort tags by tag_boost_value (desc) and
take at most MAX_TAGS_DISPLAY items (use tag_to_display_string for conversion),
then join and only then truncate to TAGS_COLUMN_WIDTH if needed; keep references
to format_tags, tag_boost_value, tag_to_display_string, TAGS_COLUMN_WIDTH and
MAX_TAGS_DISPLAY so the selection matches the documented cap.
- Around line 36-38: The file imports RankingConfig from
crate::classification::ranking; change the import in src/output/table.rs to use
the crate re-export (use crate::RankingConfig;) and, if the re-export doesn't
exist, add a public re-export (pub use
crate::classification::ranking::RankingConfig;) in the crate root or the module
where re-exports are maintained so other modules follow the import policy;
update any references to RankingConfig accordingly.
- Around line 101-160: The TTY table rows can be broken by control chars; update
format_table_tty to sanitize embedded newlines/carriage-returns/tabs and pipe
characters before truncation and padding: add/ use a helper (e.g.,
sanitize_for_tty) and apply it to found_string.text, the tags_display returned
by format_tags, and section_display (the values used in truncate_string and
pad_string) so you replace "\n", "\r", "\t" with visible escapes (or safe
single-char placeholders) and escape '|' (e.g., "\\|") prior to calling
truncate_string, pad_string, and formatting the score text.
- Around line 310-342: truncate_string can produce output longer than max_len
when the first character is multi-byte because end_index may advance past
truncate_at; update truncate_string to ensure the final byte length (including
the "..." suffix) never exceeds max_len by using char_indices to find the
greatest byte boundary <= truncate_at (where truncate_at = max_len - 3) and if
no such boundary exists return "." repeated max_len (or handle very small
max_len as before); adjust the end_index logic to only set end_index to an idx
when idx <= truncate_at and avoid the fallback that advances to the next char
boundary beyond truncate_at so format!("{}...", &s[..end_index]) is guaranteed
to be <= max_len.
In `@src/output/yara.rs`:
- Around line 8-14: The function signature for format_yara uses unused-parameter
prefixes but both parameters are used; rename parameters from _strings and
_metadata to strings and metadata (update the signature in fn
format_yara(strings: &[FoundString], metadata: &OutputMetadata) ->
Result<String>), and update all references inside the function (e.g.,
metadata.generated_at, metadata.binary_name, and any use of strings) to remove
the underscores; keep all existing logic that computes timestamp,
base_rule_name, and rule_name unchanged aside from the parameter name updates.
In `@tests/output_json_integration.rs`:
- Around line 229-233: The test test_json_unicode_content is constructing a
Unicode fixture with an ASCII encoding; update the fixture to use Encoding::Utf8
so the non-ASCII string produced by make_string(...) is encoded correctly.
Locate the place where the test calls make_string or builds the fixture (within
test_json_unicode_content) and replace the Encoding::Ascii value with
Encoding::Utf8 (or ensure make_string/make_metadata receives Utf8) so the test
data matches the Unicode content invariants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.claude/settings.json:
- Around line 2-5: The file .claude/settings.json (containing the enabledPlugins
object and keys like "commit@cc-marketplace" and
"superpowers@claude-plugins-official") appears to be a developer-local Claude AI
config and should not be committed; either remove it from version control by
adding the .claude/ directory to .gitignore and committing that change, or if
this is an intentional team-wide setting document its purpose and usage in
CONTRIBUTING.md and keep the file committed; update the repository accordingly
and remove the file from the commit history if necessary.
Address issues identified during PR review: - Rename misleading underscore-prefixed parameters (_strings, _metadata) to strings, metadata since they are actively used - Return "CLOCK_ERROR" instead of "0" on timestamp failure to make errors clearly distinguishable from valid timestamps - Remove dead code: get_yara_modifier() function and its tests after inlining "ascii" constant for ASCII/UTF-8 encodings - Simplify utf16be_hex_string() using iterator chains with flat_map() - Add Tag import and simplify tag_name() function Add comprehensive unit tests for UTF-16 encoding functions: - escape_yara_unicode_literal: basic escapes, control chars, unicode passthrough, empty string - utf16be_hex_string: basic ASCII, empty string, non-ASCII BMP chars, surrogate pairs - escape_yara_string: additional control character coverage - with_generated_at builder method validation - Default timestamp fallback behavior Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/output/yara.rs`:
- Around line 61-69: The Utf16Le branch in the match (handling
Encoding::Utf16Le) incorrectly emits escaped bytes plus the YARA "wide" modifier
which fails for non-ASCII code points; add a new helper utf16le_hex_string that
encodes item.text as UTF-16LE bytes (use char.encode_utf16() and
to_le_bytes/collect into hex like utf16be_hex_string does) and change the
Encoding::Utf16Le arm to check item.text.is_ascii(): if ascii, keep the existing
escape_yara_unicode_literal + " wide" emission, otherwise call
utf16le_hex_string and push the hex form (strings_block.push_str with var_name
and hex) so non-ASCII UTF-16LE strings are emitted as correct hex sequences.
- Around line 16-29: The binary_name and timestamp are inserted raw into both
the comment headers and the meta field generated_at, allowing injection via
newlines; wrap both usages with escape_yara_string. Specifically, when building
the comment lines (currently format!("// Binary: {}\n", metadata.binary_name)
and format!("// Generated: {}\n\n", timestamp)) call
escape_yara_string(&metadata.binary_name) and escape_yara_string(×tamp)
respectively, and likewise replace the generated_at meta interpolation
(format!(" generated_at = \"{}\"\n", timestamp)) with
escape_yara_string(×tamp) so both comments and meta fields are escaped
consistently (description already uses escape_yara_string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.coderabbit.yml (1)
352-358: Clarify the ASCII-only exception so it is enforceable.Line 354 introduces an exception ("unless explicitly required"), but bullets 2-4 still prohibit all non-ASCII unconditionally, making the exception impossible and risking inconsistent enforcement. Align the bullets with the exception or remove it.
Proposed alignment
- Verify that no Unicode punctuation is introduced unless explicitly required: - 1. No emojis in code or documentation - 2. No em-dashes - use regular hyphens - 3. No smart quotes - use straight quotes - 4. No other non-ASCII characters in strings or comments + Verify that no Unicode punctuation is introduced unless explicitly required: + 1. No emojis in code or documentation unless explicitly required + 2. No em-dashes - use regular hyphens unless explicitly required + 3. No smart quotes - use straight quotes unless explicitly required + 4. No other non-ASCII characters in strings or comments unless explicitly required (document rationale)
🤖 Fix all issues with AI agents
In `@AGENTS.md`:
- Line 9: The rule text under "ASCII only" contains typos: change "explicity" to
"explicitly" and "emjois" to "emojis" so the guideline reads clearly (e.g., "No
emojis, em-dashes, smart quotes, or Unicode punctuation (except when explicitly
testing or working with Unicode strings or emojis)"). Update the quoted rule
text "ASCII only" accordingly to preserve punctuation and intent.
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Immediate fixes:
- Fix failing doctests in extraction/mod.rs by wrapping in fn main()
- Fix rustdoc warning in patterns/ip.rs by escaping [::1]:8080
- Fix O(n^2) algorithms in dedup.rs using HashSet for unique detection
New features:
- Add OutputFormatter trait for extensible output formatting
- Add #[non_exhaustive] to OutputFormat enum for API stability
- Add Hash derive to Encoding and StringSource enums
Documentation:
- Create CHANGELOG.md following Keep a Changelog format
- Create ROADMAP.md documenting medium/long-term improvements
- Create examples/ directory with basic_extraction, network_indicators,
and output_formats examples
The O(n^2) fixes include:
- unique_sections detection now uses HashSet instead of Vec::contains
- unique_sources detection now uses HashSet instead of Vec::contains
- merge_tags now uses HashSet for deduplication
- Grouping key simplified from format!("{:?}", encoding) to Encoding
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
chore(dependabot): update schedules to weekly for various ecosystems Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
src/classification/patterns/ip.rs (1)
117-125: Consider using an if-else expression for immutability.The mutable
ip_textvariable is only assigned once conditionally. A more idiomatic approach would eliminate themutbinding:♻️ Proposed refactor
- let mut ip_text = text; - - // Check for bracketed format - if text.starts_with('[') { - // Strip port from the full text first (e.g., [::1]:8080 -> [::1]) - let without_port = strip_port(text); - // Now extract the IPv6 from brackets - ip_text = strip_ipv6_brackets(without_port); - } + // Check for bracketed format + let ip_text = if text.starts_with('[') { + // Strip port from the full text first (e.g., [::1]:8080 -> [::1]) + let without_port = strip_port(text); + // Now extract the IPv6 from brackets + strip_ipv6_brackets(without_port) + } else { + text + };src/extraction/mod.rs (1)
1-978: File exceeds 500-600 line limit.At 1548 lines, this file significantly exceeds the coding guideline limit of 500-600 lines. Per guidelines: "Keep Rust source files <=500-600 lines; split larger files into focused modules."
Consider splitting into:
mod.rs- trait definitions, re-exports, config (~200 lines)basic.rs-BasicExtractorimplementation (~400 lines)helpers.rs-extract_ascii_utf8_strings, utility functions (~100 lines)This would improve maintainability and align with project conventions.
src/lib.rs (2)
57-63: Remove non-ASCII status markers from crate docsThe status bullets contain emojis, which violate the ASCII-only rule for Rust and docs. As per coding guidelines.
Proposed change
-//! - [`container`]: Binary format detection and parsing (✅ Complete) -//! - [`extraction`]: String extraction algorithms (✅ ASCII, UTF-16LE extraction and PE resources complete) +//! - [`container`]: Binary format detection and parsing (complete) +//! - [`extraction`]: String extraction algorithms (ASCII, UTF-16LE extraction and PE resources complete) //! - ASCII extraction provides foundational encoding extraction as the reference implementation //! - UTF-16LE extraction provides Windows PE binary string extraction with confidence scoring -//! - [`classification`]: Semantic analysis and tagging (🚧 Types defined) -//! - [`output`]: Result formatting (🚧 Interfaces ready) -//! - [`types`]: Core data structures and error handling (✅ Complete) +//! - [`classification`]: Semantic analysis and tagging (types defined) +//! - [`output`]: Result formatting (interfaces ready) +//! - [`types`]: Core data structures and error handling (complete)
1-5: Add crate-level forbid(unsafe_code) and deny(warnings) attributes, and remove emojis from documentationThe crate root is missing
#![forbid(unsafe_code)]and#![deny(warnings)]attributes required by project policy. Additionally, documentation contains non-ASCII emoji characters (✅ and 🚧 on lines 57–62) which violate the ASCII-only guideline.Proposed changes
+#![forbid(unsafe_code)] +#![deny(warnings)] + //! Stringy - A smarter alternative to the strings command //! //! Stringy leverages format-specific knowledge to distinguish meaningful strings //! from random garbage data in binary files.Remove emoji status markers from lines 57–62:
- Replace
(✅ Complete)with(Complete)- Replace
(🚧 Types defined)with(In Progress - Types defined)- Replace
(🚧 Interfaces ready)with(In Progress - Interfaces ready)src/extraction/dedup.rs (1)
120-131: Do not drop confidence and bonuses for below-threshold groupsBelow-threshold groups currently return only the max original score, which removes the confidence boost and conflicts with the documented scoring formula, making scores incomparable across thresholds. If the threshold is meant to suppress bonuses, at least retain confidence (or apply the full formula consistently). As per coding guidelines.
Proposed change
- let combined_score = if meets_threshold { - calculate_combined_score(&occurrences) - } else { - // For groups below threshold, use the maximum original score without bonuses - occurrences - .iter() - .map(|occ| occ.original_score) - .max() - .unwrap_or(0) - }; + let combined_score = if meets_threshold { + calculate_combined_score(&occurrences) + } else { + // Keep confidence on the same scale even below the threshold + let base_score = occurrences + .iter() + .map(|occ| occ.original_score) + .max() + .unwrap_or(0); + let max_confidence = occurrences + .iter() + .map(|occ| occ.confidence) + .fold(0.0f32, f32::max); + base_score + (max_confidence * 10.0) as i32 + };
🤖 Fix all issues with AI agents
In `@CHANGELOG.md`:
- Around line 27-30: Update the "Changed" section in CHANGELOG.md to explicitly
document the breaking API rename: note that extract_utf16le_strings was renamed
to extract_utf16_strings (include both names and state it's a breaking change
for callers), add a short upgrade note instructing users to replace calls to
extract_utf16le_strings with extract_utf16_strings, and ensure this entry is
listed alongside the other "Changed" items so consumers see the migration step
when upgrading.
In `@CONTRIBUTING.md`:
- Around line 58-68: In CONTRIBUTING.md under the "## Tests" section remove the
duplicate insta snapshot guidance by keeping one concise sentence (e.g., "Use
insta snapshots for output verification when appropriate or when changing output
formatters") and delete the redundant line; also update the empty/unspecified
code fence under the "Run:" block to include a language specifier (e.g.,
```bash) so the `just test` snippet is properly highlighted and the file no
longer contains an empty/unspecified code block.
In `@examples/basic_extraction.rs`:
- Around line 53-71: The current truncation uses byte slicing
(&string.text[..47]) which can panic on multi-byte UTF-8; change to count
characters with string.text.chars().count() for the length check and perform
truncation with a character-aware approach like
string.text.chars().take(47).collect::<String>() when building the displayed
text (keep the else branch using string.text.clone()); update the condition and
the truncated branch in the loop over sorted_strings to use these char-aware
operations to avoid UTF-8 boundary panics.
In `@examples/network_indicators.rs`:
- Around line 8-12: Add re-exports for the container utilities in lib.rs and
update examples to import from the crate root: add a pub use
container::{create_parser, detect_format}; to lib.rs so create_parser and
detect_format are exported alongside extraction and types, then change the
example files (examples/network_indicators.rs, basic_extraction.rs,
output_formats.rs) to import create_parser and detect_format from stringy::
(crate root) instead of stringy::container to comply with the import guideline.
In `@examples/output_formats.rs`:
- Around line 48-65: Capture the full count of extracted strings before
truncation and pass it as total_strings to OutputMetadata::new: compute a
variable (e.g., full_count = sorted_strings.len()) immediately after sorting
(before calling .into_iter().take(50)), then use full_count for the
total_strings argument and top_strings.len() for filtered_strings when
constructing OutputMetadata (referencing sorted_strings, top_strings,
OutputMetadata::new, binary_name, output_format).
- Around line 10-15: Update the top-level imports in examples/output_formats.rs
to use crate-root re-exports for BasicExtractor, ExtractionConfig,
StringExtractor, OutputFormat, OutputMetadata, and format_output instead of
importing them from deep modules; keep create_parser and detect_format imported
from stringy::container. Concretely, replace
stringy::extraction::{BasicExtractor, ExtractionConfig, StringExtractor} and
stringy::output::{OutputFormat, OutputMetadata, format_output} with a single
import of those symbols from the crate root while leaving use
stringy::container::{create_parser, detect_format} unchanged.
In `@ROADMAP.md`:
- Around line 9-33: Update ROADMAP.md to use current, accurate line counts and
replace any non-ASCII punctuation: recalc the lines for src/extraction/mod.rs
(1548) and src/extraction/dedup.rs (838) and regenerate the table rows for all
listed files so the "Lines" and "Overage" columns match the repository's current
counts; scan and replace any em-dashes, smart quotes or other non-ASCII
characters with regular hyphens and straight quotes throughout the file to
ensure ASCII-only documentation (verify entries referencing
src/extraction/config.rs, src/extraction/trait.rs, src/extraction/basic.rs,
src/extraction/helpers.rs and the listed oversized files are updated).
- Around line 93-107: Update the roadmap section header line numbers from 93–107
to 143–165 and add a short safety note next to the memmap2 example documenting
the unsafe invariants: explain why Mmap::map is wrapped in unsafe, state that
the mapped file must not be truncated or modified for the lifetime of the Mmap,
that any mutable access to the underlying file must be avoided while the mapping
exists, and recommend validating file size/permissions before mapping; reference
the memmap2::Mmap type and the unsafe block (unsafe { Mmap::map(&file)? }) so
readers know exactly where the safety requirements apply.
| ### Changed | ||
| - Repository renamed from StringyMcStringFace to Stringy | ||
| - Improved YARA formatter code quality and test coverage | ||
| - Clarified ASCII rule for Unicode handling in documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the extract_utf16le_strings to extract_utf16_strings API rename.
The UTF-16 extraction function was renamed in this PR. This is a breaking change that should be explicitly documented under "Changed" for users upgrading.
Proposed addition
### Changed
- Repository renamed from StringyMcStringFace to Stringy
- Improved YARA formatter code quality and test coverage
- Clarified ASCII rule for Unicode handling in documentation
+- Renamed `extract_utf16le_strings` to `extract_utf16_strings` for consistency📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### Changed | |
| - Repository renamed from StringyMcStringFace to Stringy | |
| - Improved YARA formatter code quality and test coverage | |
| - Clarified ASCII rule for Unicode handling in documentation | |
| ### Changed | |
| - Repository renamed from StringyMcStringFace to Stringy | |
| - Improved YARA formatter code quality and test coverage | |
| - Clarified ASCII rule for Unicode handling in documentation | |
| - Renamed `extract_utf16le_strings` to `extract_utf16_strings` for consistency |
🤖 Prompt for AI Agents
In `@CHANGELOG.md` around lines 27 - 30, Update the "Changed" section in
CHANGELOG.md to explicitly document the breaking API rename: note that
extract_utf16le_strings was renamed to extract_utf16_strings (include both names
and state it's a breaking change for callers), add a short upgrade note
instructing users to replace calls to extract_utf16le_strings with
extract_utf16_strings, and ensure this entry is listed alongside the other
"Changed" items so consumers see the migration step when upgrading.
| ## Tests | ||
|
|
||
| - Add or update tests for behavior changes. | ||
| - Use insta snapshots for output verification when appropriate. | ||
| - Integration tests live in tests/ and fixtures in tests/fixtures/. | ||
| - Use insta snapshots for output verification when changing output formatters. | ||
|
|
||
| Run: | ||
|
|
||
| - `just test` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate guidance and missing language specifier.
Lines 61 and 63 contain duplicate text about insta snapshots:
- Line 61: "Use insta snapshots for output verification when appropriate."
- Line 63: "Use insta snapshots for output verification when changing output formatters."
Also, line 59 has an empty code block - consider adding a language specifier or removing it if unneeded.
Proposed fix
## Tests
- Add or update tests for behavior changes.
-- Use insta snapshots for output verification when appropriate.
+- Use insta snapshots for output verification, especially when changing output formatters.
- Integration tests live in tests/ and fixtures in tests/fixtures/.
-- Use insta snapshots for output verification when changing output formatters.
-
-Run:
-
-- `just test`
+- Run tests with `just test`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Tests | |
| - Add or update tests for behavior changes. | |
| - Use insta snapshots for output verification when appropriate. | |
| - Integration tests live in tests/ and fixtures in tests/fixtures/. | |
| - Use insta snapshots for output verification when changing output formatters. | |
| Run: | |
| - `just test` | |
| ## Tests | |
| - Add or update tests for behavior changes. | |
| - Use insta snapshots for output verification, especially when changing output formatters. | |
| - Integration tests live in tests/ and fixtures in tests/fixtures/. | |
| - Run tests with `just test` |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
59-59: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@CONTRIBUTING.md` around lines 58 - 68, In CONTRIBUTING.md under the "##
Tests" section remove the duplicate insta snapshot guidance by keeping one
concise sentence (e.g., "Use insta snapshots for output verification when
appropriate or when changing output formatters") and delete the redundant line;
also update the empty/unspecified code fence under the "Run:" block to include a
language specifier (e.g., ```bash) so the `just test` snippet is properly
highlighted and the file no longer contains an empty/unspecified code block.
| println!("Top strings by score:"); | ||
| println!("{:-<60}", ""); | ||
| for string in sorted_strings.iter().take(20) { | ||
| let tags: Vec<_> = string.tags.iter().map(|t| format!("{:?}", t)).collect(); | ||
| let tags_str = if tags.is_empty() { | ||
| String::new() | ||
| } else { | ||
| format!(" [{}]", tags.join(", ")) | ||
| }; | ||
| println!( | ||
| "{:4} | {:50}{}", | ||
| string.score, | ||
| if string.text.len() > 50 { | ||
| format!("{}...", &string.text[..47]) | ||
| } else { | ||
| string.text.clone() | ||
| }, | ||
| tags_str | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -name "basic_extraction.rs" -type f | head -20Repository: EvilBit-Labs/Stringy
Length of output: 94
🏁 Script executed:
cat -n ./examples/basic_extraction.rs | sed -n '40,80p'Repository: EvilBit-Labs/Stringy
Length of output: 1378
🏁 Script executed:
find . -name "types.rs" -type f | head -5Repository: EvilBit-Labs/Stringy
Length of output: 78
🏁 Script executed:
cat -n ./src/types.rs | head -100Repository: EvilBit-Labs/Stringy
Length of output: 3434
🏁 Script executed:
cat -n ./src/types.rs | sed -n '100,200p'Repository: EvilBit-Labs/Stringy
Length of output: 3441
🏁 Script executed:
cat -n ./src/types.rs | sed -n '200,300p'Repository: EvilBit-Labs/Stringy
Length of output: 5130
🏁 Script executed:
rg -A 5 "FoundString {" src/ | head -40Repository: EvilBit-Labs/Stringy
Length of output: 153
🏁 Script executed:
rg "text:" src/ -A 3 -B 3 | head -60Repository: EvilBit-Labs/Stringy
Length of output: 2456
🏁 Script executed:
rg "Encoding::Utf8" src/ -B 3 -A 3Repository: EvilBit-Labs/Stringy
Length of output: 22424
Fix byte slicing to handle multi-byte UTF-8 characters
The condition if string.text.len() > 50 checks byte length, but &string.text[..47] performs byte slicing. This will panic when a multi-byte UTF-8 character spans byte boundaries. Since the tool actively extracts UTF-8 strings (including multi-byte characters), use character-aware truncation instead.
Proposed change
- println!(
- "{:4} | {:50}{}",
- string.score,
- if string.text.len() > 50 {
- format!("{}...", &string.text[..47])
- } else {
- string.text.clone()
- },
- tags_str
- );
+ let display_text = if string.text.chars().count() > 50 {
+ let truncated: String = string.text.chars().take(47).collect();
+ format!("{}...", truncated)
+ } else {
+ string.text.clone()
+ };
+ println!("{:4} | {:50}{}", string.score, display_text, tags_str);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| println!("Top strings by score:"); | |
| println!("{:-<60}", ""); | |
| for string in sorted_strings.iter().take(20) { | |
| let tags: Vec<_> = string.tags.iter().map(|t| format!("{:?}", t)).collect(); | |
| let tags_str = if tags.is_empty() { | |
| String::new() | |
| } else { | |
| format!(" [{}]", tags.join(", ")) | |
| }; | |
| println!( | |
| "{:4} | {:50}{}", | |
| string.score, | |
| if string.text.len() > 50 { | |
| format!("{}...", &string.text[..47]) | |
| } else { | |
| string.text.clone() | |
| }, | |
| tags_str | |
| ); | |
| println!("Top strings by score:"); | |
| println!("{:-<60}", ""); | |
| for string in sorted_strings.iter().take(20) { | |
| let tags: Vec<_> = string.tags.iter().map(|t| format!("{:?}", t)).collect(); | |
| let tags_str = if tags.is_empty() { | |
| String::new() | |
| } else { | |
| format!(" [{}]", tags.join(", ")) | |
| }; | |
| let display_text = if string.text.chars().count() > 50 { | |
| let truncated: String = string.text.chars().take(47).collect(); | |
| format!("{}...", truncated) | |
| } else { | |
| string.text.clone() | |
| }; | |
| println!("{:4} | {:50}{}", string.score, display_text, tags_str); | |
| } |
🤖 Prompt for AI Agents
In `@examples/basic_extraction.rs` around lines 53 - 71, The current truncation
uses byte slicing (&string.text[..47]) which can panic on multi-byte UTF-8;
change to count characters with string.text.chars().count() for the length check
and perform truncation with a character-aware approach like
string.text.chars().take(47).collect::<String>() when building the displayed
text (keep the else branch using string.text.clone()); update the condition and
the truncated branch in the loop over sorted_strings to use these char-aware
operations to avoid UTF-8 boundary panics.
| use std::env; | ||
| use std::fs; | ||
| use stringy::container::{create_parser, detect_format}; | ||
| use stringy::extraction::{BasicExtractor, ExtractionConfig, StringExtractor}; | ||
| use stringy::types::Tag; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -name "lib.rs" -path "*/src/*" | head -5Repository: EvilBit-Labs/Stringy
Length of output: 76
🏁 Script executed:
cat -n src/lib.rs | head -100Repository: EvilBit-Labs/Stringy
Length of output: 4497
🏁 Script executed:
rg "create_parser|detect_format" src/lib.rs -A 2 -B 2Repository: EvilBit-Labs/Stringy
Length of output: 448
🏁 Script executed:
ls -la src/container.rs 2>/dev/null && wc -l src/container.rs || echo "File not found"Repository: EvilBit-Labs/Stringy
Length of output: 78
🏁 Script executed:
rg "^pub fn (create_parser|detect_format)" src/ -B 2 -A 5Repository: EvilBit-Labs/Stringy
Length of output: 1145
🏁 Script executed:
grep -n "pub fn\|pub struct\|pub enum" src/container.rs | head -20Repository: EvilBit-Labs/Stringy
Length of output: 113
🏁 Script executed:
find examples -name "*.rs" -type fRepository: EvilBit-Labs/Stringy
Length of output: 150
🏁 Script executed:
head -15 examples/network_indicators.rsRepository: EvilBit-Labs/Stringy
Length of output: 635
🏁 Script executed:
head -15 examples/basic_extraction.rs examples/output_formats.rsRepository: EvilBit-Labs/Stringy
Length of output: 1197
Add re-exports to lib.rs and update example imports
The example violates the guideline: "Imports must be from stringy::extraction or stringy::types, not deeply nested paths - use re-exports from lib.rs." The create_parser and detect_format functions are not currently re-exported in lib.rs.
Update lib.rs to re-export these items from the container module alongside extraction and types:
pub use container::{create_parser, detect_format};Then update this example (and basic_extraction.rs, output_formats.rs) to import from the crate root:
-use stringy::container::{create_parser, detect_format};
-use stringy::extraction::{BasicExtractor, ExtractionConfig, StringExtractor};
-use stringy::types::Tag;
+use stringy::{create_parser, detect_format, BasicExtractor, ExtractionConfig, StringExtractor, Tag};🤖 Prompt for AI Agents
In `@examples/network_indicators.rs` around lines 8 - 12, Add re-exports for the
container utilities in lib.rs and update examples to import from the crate root:
add a pub use container::{create_parser, detect_format}; to lib.rs so
create_parser and detect_format are exported alongside extraction and types,
then change the example files (examples/network_indicators.rs,
basic_extraction.rs, output_formats.rs) to import create_parser and
detect_format from stringy:: (crate root) instead of stringy::container to
comply with the import guideline.
| use std::env; | ||
| use std::fs; | ||
| use stringy::container::{create_parser, detect_format}; | ||
| use stringy::extraction::{BasicExtractor, ExtractionConfig, StringExtractor}; | ||
| use stringy::output::{OutputFormat, OutputMetadata, format_output}; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine lib.rs to see what's re-exported at crate root
find . -name "lib.rs" -type f | head -5Repository: EvilBit-Labs/Stringy
Length of output: 76
🏁 Script executed:
# Check the current state of examples/output_formats.rs around lines 10-15
cat -n examples/output_formats.rs | head -20Repository: EvilBit-Labs/Stringy
Length of output: 989
🏁 Script executed:
# List the files in src to understand the structure
ls -la src/ 2>/dev/null || find . -path "*/src/*.rs" -type f | head -20Repository: EvilBit-Labs/Stringy
Length of output: 603
🏁 Script executed:
# Examine lib.rs to see what's re-exported at crate root
cat -n src/lib.rsRepository: EvilBit-Labs/Stringy
Length of output: 4497
🏁 Script executed:
# Also check if there's an output module and what it exports
cat -n src/output/mod.rs 2>/dev/null || find src/output -name "*.rs" | xargs cat -nRepository: EvilBit-Labs/Stringy
Length of output: 13904
🏁 Script executed:
# Check container module structure
ls -la src/container/Repository: EvilBit-Labs/Stringy
Length of output: 406
🏁 Script executed:
# Check what container module exports
cat -n src/container/mod.rsRepository: EvilBit-Labs/Stringy
Length of output: 2837
Align extraction and output imports with crate-root re-exports
The BasicExtractor, ExtractionConfig, StringExtractor, OutputFormat, OutputMetadata, and format_output are already re-exported from the crate root. Use those re-exports instead of deeply nested paths. However, create_parser and detect_format are not re-exported, so keep those imports from stringy::container.
Proposed change
-use stringy::container::{create_parser, detect_format};
-use stringy::extraction::{BasicExtractor, ExtractionConfig, StringExtractor};
-use stringy::output::{OutputFormat, OutputMetadata, format_output};
+use stringy::{
+ create_parser, detect_format, // Still from container module (not re-exported)
+ BasicExtractor, ExtractionConfig, StringExtractor,
+ OutputFormat, OutputMetadata, format_output,
+};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| use std::env; | |
| use std::fs; | |
| use stringy::container::{create_parser, detect_format}; | |
| use stringy::extraction::{BasicExtractor, ExtractionConfig, StringExtractor}; | |
| use stringy::output::{OutputFormat, OutputMetadata, format_output}; | |
| use std::env; | |
| use std::fs; | |
| use stringy::{ | |
| BasicExtractor, ExtractionConfig, StringExtractor, | |
| OutputFormat, OutputMetadata, format_output, | |
| }; | |
| use stringy::container::{create_parser, detect_format}; |
🤖 Prompt for AI Agents
In `@examples/output_formats.rs` around lines 10 - 15, Update the top-level
imports in examples/output_formats.rs to use crate-root re-exports for
BasicExtractor, ExtractionConfig, StringExtractor, OutputFormat, OutputMetadata,
and format_output instead of importing them from deep modules; keep
create_parser and detect_format imported from stringy::container. Concretely,
replace stringy::extraction::{BasicExtractor, ExtractionConfig, StringExtractor}
and stringy::output::{OutputFormat, OutputMetadata, format_output} with a single
import of those symbols from the crate root while leaving use
stringy::container::{create_parser, detect_format} unchanged.
| // Limit to top 50 strings for demonstration | ||
| let mut sorted_strings = strings; | ||
| sorted_strings.sort_by(|a, b| b.score.cmp(&a.score)); | ||
| let top_strings: Vec<_> = sorted_strings.into_iter().take(50).collect(); | ||
|
|
||
| // Create output metadata | ||
| let binary_name = std::path::Path::new(path) | ||
| .file_name() | ||
| .and_then(|n| n.to_str()) | ||
| .unwrap_or("unknown") | ||
| .to_string(); | ||
|
|
||
| let metadata = OutputMetadata::new( | ||
| binary_name, | ||
| output_format, | ||
| top_strings.len(), | ||
| top_strings.len(), | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and read the file
fd examples/output_formats.rsRepository: EvilBit-Labs/Stringy
Length of output: 46
🏁 Script executed:
# Read the relevant section of the file
head -n 75 examples/output_formats.rs | tail -n +40Repository: EvilBit-Labs/Stringy
Length of output: 1023
🏁 Script executed:
# Find OutputMetadata definition to understand the parameters
rg -A 10 "struct OutputMetadata" --type rsRepository: EvilBit-Labs/Stringy
Length of output: 90
🏁 Script executed:
# Search for OutputMetadata struct definition
rg "struct OutputMetadata" --type-add 'rust:*.rs'Repository: EvilBit-Labs/Stringy
Length of output: 203
🏁 Script executed:
# Also search for impl OutputMetadata::new
rg -A 15 "impl OutputMetadata" --type-add 'rust:*.rs'Repository: EvilBit-Labs/Stringy
Length of output: 750
🏁 Script executed:
# Find where OutputMetadata is defined
fd -e rs -type f | head -20Repository: EvilBit-Labs/Stringy
Length of output: 234
🏁 Script executed:
# Let's search more broadly
rg "OutputMetadata" -A 5 -B 5Repository: EvilBit-Labs/Stringy
Length of output: 25073
Capture full extracted count before truncation to accurately reflect metadata
The total_strings field should represent all extracted strings before display limits are applied. Currently both total_strings and filtered_strings are set to the truncated count, making it impossible to distinguish between total extractions and displayed results.
Capture sorted_strings.len() before the .take(50) truncation and pass that as total_strings to OutputMetadata::new().
Proposed change
let mut sorted_strings = strings;
sorted_strings.sort_by(|a, b| b.score.cmp(&a.score));
+ let total_strings = sorted_strings.len();
let top_strings: Vec<_> = sorted_strings.into_iter().take(50).collect();
let metadata = OutputMetadata::new(
binary_name,
output_format,
- top_strings.len(),
+ total_strings,
top_strings.len(),
);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Limit to top 50 strings for demonstration | |
| let mut sorted_strings = strings; | |
| sorted_strings.sort_by(|a, b| b.score.cmp(&a.score)); | |
| let top_strings: Vec<_> = sorted_strings.into_iter().take(50).collect(); | |
| // Create output metadata | |
| let binary_name = std::path::Path::new(path) | |
| .file_name() | |
| .and_then(|n| n.to_str()) | |
| .unwrap_or("unknown") | |
| .to_string(); | |
| let metadata = OutputMetadata::new( | |
| binary_name, | |
| output_format, | |
| top_strings.len(), | |
| top_strings.len(), | |
| ); | |
| // Limit to top 50 strings for demonstration | |
| let mut sorted_strings = strings; | |
| sorted_strings.sort_by(|a, b| b.score.cmp(&a.score)); | |
| let total_strings = sorted_strings.len(); | |
| let top_strings: Vec<_> = sorted_strings.into_iter().take(50).collect(); | |
| // Create output metadata | |
| let binary_name = std::path::Path::new(path) | |
| .file_name() | |
| .and_then(|n| n.to_str()) | |
| .unwrap_or("unknown") | |
| .to_string(); | |
| let metadata = OutputMetadata::new( | |
| binary_name, | |
| output_format, | |
| total_strings, | |
| top_strings.len(), | |
| ); |
🤖 Prompt for AI Agents
In `@examples/output_formats.rs` around lines 48 - 65, Capture the full count of
extracted strings before truncation and pass it as total_strings to
OutputMetadata::new: compute a variable (e.g., full_count =
sorted_strings.len()) immediately after sorting (before calling
.into_iter().take(50)), then use full_count for the total_strings argument and
top_strings.len() for filtered_strings when constructing OutputMetadata
(referencing sorted_strings, top_strings, OutputMetadata::new, binary_name,
output_format).
| #### 12. Add memory mapping for large files | ||
| **Priority:** High | ||
| **Current state:** Entire file is loaded into memory | ||
| **Impact:** Processing 1GB+ binaries requires 1GB+ RAM | ||
| **Recommendation:** Use `memmap2` crate for memory-mapped file access | ||
|
|
||
| ```rust | ||
| // Recommended approach | ||
| use memmap2::Mmap; | ||
| use std::fs::File; | ||
|
|
||
| let file = File::open(path)?; | ||
| let mmap = unsafe { Mmap::map(&file)? }; | ||
| let data: &[u8] = &mmap; | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the ROADMAP.md file
find . -name "ROADMAP.md" -type f | head -5Repository: EvilBit-Labs/Stringy
Length of output: 76
🏁 Script executed:
# Check if ROADMAP.md exists and read the relevant section
if [ -f "ROADMAP.md" ]; then
echo "=== Lines 93-107 of ROADMAP.md ==="
sed -n '93,107p' ROADMAP.md
else
echo "ROADMAP.md not found in repository root"
# Search for it in subdirectories
find . -name "ROADMAP.md" -type f 2>/dev/null
fiRepository: EvilBit-Labs/Stringy
Length of output: 375
🏁 Script executed:
# Check the total line count to confirm the section exists
if [ -f "ROADMAP.md" ]; then
wc -l ROADMAP.md
fiRepository: EvilBit-Labs/Stringy
Length of output: 78
🏁 Script executed:
# Search for memmap2 in ROADMAP.md
grep -n "memmap2\|memory mapping\|memory-mapped" ROADMAP.mdRepository: EvilBit-Labs/Stringy
Length of output: 208
🏁 Script executed:
# Search for the heading mentioned in the review
grep -n "Add memory mapping for large files" ROADMAP.mdRepository: EvilBit-Labs/Stringy
Length of output: 111
🏁 Script executed:
# Get a broader view of the file structure
grep -n "^#### " ROADMAP.md | head -20Repository: EvilBit-Labs/Stringy
Length of output: 1067
🏁 Script executed:
# Read the section containing memmap2 (around line 143)
sed -n '143,165p' ROADMAP.mdRepository: EvilBit-Labs/Stringy
Length of output: 528
🏁 Script executed:
# Check the exact content without the code fence markers - verify no safety docs exist
sed -n '143,165p' ROADMAP.md | grep -i "safety\|invariant\|mutable\|truncat"Repository: EvilBit-Labs/Stringy
Length of output: 46
🏁 Script executed:
# Check for ASCII compliance in ROADMAP.md
# Look for non-ASCII characters, em-dashes, smart quotes, emojis
python3 << 'EOF'
import re
with open('ROADMAP.md', 'r', encoding='utf-8') as f:
content = f.read()
lines = content.split('\n')
# Check for common non-ASCII issues
issues = []
for i, line in enumerate(lines, 1):
# Look for em-dashes (—), en-dashes (–), smart quotes (" "), ellipsis (…), emojis
if '—' in line:
issues.append(f"Line {i}: em-dash found")
if '"' in line or '"' in line:
issues.append(f"Line {i}: smart quote found")
if '…' in line:
issues.append(f"Line {i}: ellipsis found")
# Check for non-ASCII characters (excluding common technical symbols)
try:
line.encode('ascii')
except UnicodeEncodeError:
# Re-check to identify specific problematic chars
for j, char in enumerate(line):
if ord(char) > 127:
issues.append(f"Line {i}, position {j}: non-ASCII character '{char}' (U+{ord(char):04X})")
if issues:
for issue in issues[:20]: # Show first 20
print(issue)
if len(issues) > 20:
print(f"... and {len(issues) - 20} more issues")
else:
print("✓ No ASCII violations found in ROADMAP.md")
EOFRepository: EvilBit-Labs/Stringy
Length of output: 119
🌐 Web query:
memmap2 crate safety invariants unsafe Mmap::map documentation
💡 Result:
Key documented safety invariants for memmap2's unsafe Mmap::map:
-
The file-backed constructors (including Mmap::map) are unsafe because accessing the map can cause Undefined Behavior if the underlying file is modified while the map is live (in‑process or by another process). You must ensure the file cannot be (or will not be) changed while the mapping is used (e.g., via permissions, locks, or process-private/unlinked files). [1]
-
Mmap is Sync + Send and the Mmap handle is independent of the File used to create it (the file can be dropped after mapping); the memory is unmapped when the Mmap is dropped. [1][2]
-
Mmap::map is equivalent to MmapOptions::new().map(file) and its safety section refers to the crate-level/type-level docs for the exact rationale for marking it unsafe. [1]
References:
- memmap2 Mmap docs. [1]
- memmap2 source/type-level docs (crate docs in lib.rs). [2]
Correct line numbers to 143–165 and add safety documentation for the unsafe memmap2 usage.
The code example uses unsafe { Mmap::map(...) } without documenting the safety invariants. Per memmap2's documentation, the mapping is unsafe because the underlying file can be modified or truncated during use, causing undefined behavior. Add a note explaining the key safety requirements:
Proposed safety documentation
let file = File::open(path)?;
let mmap = unsafe { Mmap::map(&file)? };
let data: &[u8] = &mmap;
+
+Safety: `Mmap::map` is unsafe because the file can be modified or truncated while mapped,
+causing undefined behavior. Ensure the file remains unchanged during the mapping's lifetime
+(via file permissions, locks, or process-private files), avoid mapping files subject to
+concurrent modification, and validate file length if you rely on stable boundaries.🤖 Prompt for AI Agents
In `@ROADMAP.md` around lines 93 - 107, Update the roadmap section header line
numbers from 93–107 to 143–165 and add a short safety note next to the memmap2
example documenting the unsafe invariants: explain why Mmap::map is wrapped in
unsafe, state that the mapped file must not be truncated or modified for the
lifetime of the Mmap, that any mutable access to the underlying file must be
avoided while the mapping exists, and recommend validating file size/permissions
before mapping; reference the memmap2::Mmap type and the unsafe block (unsafe {
Mmap::map(&file)? }) so readers know exactly where the safety requirements
apply.
Split the oversized table.rs (708 lines) into a module directory with four focused files, all under the 500-line project limit: - table/mod.rs (120 lines): Public API, constants, entry points - table/tty.rs (220 lines): TTY table rendering with column alignment - table/plain.rs (96 lines): Plain text output for piping - table/formatting.rs (326 lines): Shared utilities (truncate, pad, format_tags) This addresses the ticket acceptance criterion requiring all files to be under 500 lines. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@src/output/table/formatting.rs`:
- Around line 6-8: The import of RankingConfig uses a deep module path; update
the import to use the crate re-export (e.g., replace the deep path with the
re-exported symbol like crate::RankingConfig) or add a re-export for
RankingConfig in the crate root (lib.rs) if it’s missing; locate the import
statement in formatting.rs that references RankingConfig and either change it to
use crate::RankingConfig or add pub use
crate::classification::ranking::RankingConfig; in either case ensure other code
compiles against the new re-exported path.
- Around line 121-153: The truncate_string function can choose a character
boundary that yields an output longer than max_len when the first char is
multibyte; update the logic that computes end_index (based on truncate_at and
s.char_indices()) so that you only include an extra character if its byte index
is <= truncate_at (i.e., find the last char boundary <= truncate_at and use
that), and remove the fallback that unconditionally advances to include the
second char; if no valid boundary exists leave end_index at 0 so the result
becomes "..." (which will not exceed max_len). Reference: truncate_string,
truncate_at, end_index, and the char_indices() iteration.
In `@src/output/table/mod.rs`:
- Around line 29-31: Update the module doc comment in src/output/table/mod.rs so
the "Tags" bullet reflects actual behavior: state that tags are the top
(highest-boost) 2–3 tags (not the first tags), comma-separated, and each tag is
truncated to 20 characters; keep the rest of the bullets (String and Score)
unchanged and reference the same doc block around the "Tags" line that currently
describes tag selection.
In `@src/output/table/tty.rs`:
- Around line 20-75: format_table_tty currently uses FoundString.text directly
when building rows which allows embedded CR/LF and tabs to break the TTY table;
before calling truncate_string and pad_string inside the loop, sanitize
found_string.text by normalizing or removing control whitespace (replace '\r'
and '\n' with a single space and convert tabs to a space, optionally collapse
consecutive spaces) and use that sanitized string for truncate_string,
pad_string, and for calculating tags/section widths if needed (references:
format_table_tty, FoundString.text, truncate_string, pad_string,
calculate_section_width, calculate_tags_width, format_tags).
src/output/table/formatting.rs
Outdated
| use crate::classification::ranking::RankingConfig; | ||
| use crate::types::Tag; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Align RankingConfig import with crate re-export policy.
Line 6 imports RankingConfig via a deep module path. Please use the crate re-export (or add it if missing) to comply with the import policy. As per coding guidelines, align import paths via crate re-exports.
🤖 Prompt for AI Agents
In `@src/output/table/formatting.rs` around lines 6 - 8, The import of
RankingConfig uses a deep module path; update the import to use the crate
re-export (e.g., replace the deep path with the re-exported symbol like
crate::RankingConfig) or add a re-export for RankingConfig in the crate root
(lib.rs) if it’s missing; locate the import statement in formatting.rs that
references RankingConfig and either change it to use crate::RankingConfig or add
pub use crate::classification::ranking::RankingConfig; in either case ensure
other code compiles against the new re-exported path.
| //! - **String**: Up to 60 characters, truncated with `...` if longer | ||
| //! - **Tags**: First 2-3 tags, comma-separated, max 20 characters | ||
| //! - **Score**: Right-aligned integer score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs mismatch tag selection behavior.
Line 30 states tags show the first 2-3 tags, but the implementation selects highest boost tags with truncation. Update the doc to match actual behavior.
Proposed fix
-//! - **Tags**: First 2-3 tags, comma-separated, max 20 characters
+//! - **Tags**: Highest priority tags (ties), comma-separated, max 20 characters📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| //! - **String**: Up to 60 characters, truncated with `...` if longer | |
| //! - **Tags**: First 2-3 tags, comma-separated, max 20 characters | |
| //! - **Score**: Right-aligned integer score | |
| //! - **String**: Up to 60 characters, truncated with `...` if longer | |
| //! - **Tags**: Highest priority tags (ties), comma-separated, max 20 characters | |
| //! - **Score**: Right-aligned integer score |
🤖 Prompt for AI Agents
In `@src/output/table/mod.rs` around lines 29 - 31, Update the module doc comment
in src/output/table/mod.rs so the "Tags" bullet reflects actual behavior: state
that tags are the top (highest-boost) 2–3 tags (not the first tags),
comma-separated, and each tag is truncated to 20 characters; keep the rest of
the bullets (String and Score) unchanged and reference the same doc block around
the "Tags" line that currently describes tag selection.
Fix two issues in YARA output formatter: 1. UTF-16LE non-ASCII handling: The `wide` modifier only works for ASCII characters. Non-ASCII UTF-16LE strings now use hex encoding instead of the incorrect `wide` modifier approach. - Add `utf16le_hex_string` helper (mirrors `utf16be_hex_string`) - Branch on `is_ascii()` to choose appropriate encoding 2. Injection prevention: Escape `binary_name` and `timestamp` in comment headers and meta fields to prevent newline/quote injection attacks. Add 8 new tests covering: - UTF-16LE hex string encoding (basic, empty, non-ASCII, surrogate pairs) - ASCII UTF-16LE still uses wide modifier - Non-ASCII UTF-16LE uses hex encoding - Binary name injection escaping - Timestamp injection escaping Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@ROADMAP.md`:
- Around line 11-12: The bold-only metadata lines (e.g., "**Priority:** High
**Current state:** 1542 lines (exceeds 500-line project limit by 1042 lines)
**Files affected:** `src/extraction/mod.rs`") are being treated as emphasis
headings by markdownlint MD036; replace these bold-only lines with either a
Markdown list or a Markdown table (for example, convert to "- Priority: High",
"- Current state: 1542 lines..." and "- Files affected: src/extraction/mod.rs"
or a two-column table) and apply the same pattern consistently to every other
metadata block in ROADMAP.md so no line consists solely of bold/emphasis markup.
In `@src/output/yara.rs`:
- Around line 266-536: The module-level #[cfg(test)] mod tests block (containing
test functions like test_sanitize_rule_name, test_escape_yara_string,
test_utf16be_hex_string_*, etc.) makes the source file exceed the 500-line
guideline; extract that entire tests block into a separate test module (e.g.,
yara_tests) so the main module only contains implementation code. Create a new
test file/module containing the same use imports and test functions, keep the
helper fns (make_metadata, make_string) inside that test module, and remove the
#[cfg(test)] mod tests block from this file; ensure the moved tests still
reference the same public functions (sanitize_rule_name, escape_yara_string,
escape_yara_unicode_literal, utf16be_hex_string, utf16le_hex_string,
format_yara, group_strings_by_tag) so they compile unchanged.
- Around line 251-264: Add a short doc comment above the group_strings_by_tag
function explaining that when a FoundString has multiple tags only the first tag
(via tag_name on the first entry) is used for grouping and that entries with no
tags are grouped under "untagged"; reference the function name
group_strings_by_tag, the FoundString type, and the tag_name helper in the
comment so future maintainers understand this deliberate simplification.
| **Priority:** High **Current state:** 1542 lines (exceeds 500-line project limit by 1042 lines) **Files affected:** `src/extraction/mod.rs` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid bold-only lines to satisfy MD036.
Markdownlint flags these lines as emphasis used as a heading. Prefer a list (or table) for the metadata and apply the same pattern to the other items.
Proposed fix (apply consistently)
-**Priority:** High **Current state:** 1542 lines (exceeds 500-line project limit by 1042 lines) **Files affected:** `src/extraction/mod.rs`
+- Priority: High
+- Current state: 1542 lines (exceeds 500-line project limit by 1042 lines)
+- Files affected: `src/extraction/mod.rs`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Priority:** High **Current state:** 1542 lines (exceeds 500-line project limit by 1042 lines) **Files affected:** `src/extraction/mod.rs` | |
| - Priority: High | |
| - Current state: 1542 lines (exceeds 500-line project limit by 1042 lines) | |
| - Files affected: `src/extraction/mod.rs` | |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
11-11: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🤖 Prompt for AI Agents
In `@ROADMAP.md` around lines 11 - 12, The bold-only metadata lines (e.g.,
"**Priority:** High **Current state:** 1542 lines (exceeds 500-line project
limit by 1042 lines) **Files affected:** `src/extraction/mod.rs`") are being
treated as emphasis headings by markdownlint MD036; replace these bold-only
lines with either a Markdown list or a Markdown table (for example, convert to
"- Priority: High", "- Current state: 1542 lines..." and "- Files affected:
src/extraction/mod.rs" or a two-column table) and apply the same pattern
consistently to every other metadata block in ROADMAP.md so no line consists
solely of bold/emphasis markup.
| fn group_strings_by_tag(strings: &[FoundString]) -> BTreeMap<String, Vec<&FoundString>> { | ||
| let mut grouped: BTreeMap<String, Vec<&FoundString>> = BTreeMap::new(); | ||
|
|
||
| for item in strings { | ||
| let tag = item | ||
| .tags | ||
| .first() | ||
| .map(|tag| tag_name(tag).to_string()) | ||
| .unwrap_or_else(|| "untagged".to_string()); | ||
| grouped.entry(tag).or_default().push(item); | ||
| } | ||
|
|
||
| grouped | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Consider documenting single-tag grouping behavior.
The function uses only the first tag when a string has multiple tags. This is a reasonable simplification for YARA output, but consider adding a brief doc comment clarifying this behavior so future maintainers understand the design choice.
Suggested documentation
+/// Groups strings by their primary tag (first tag only).
+///
+/// Strings with no tags are grouped under "untagged".
+/// If a string has multiple tags, only the first tag is used for grouping.
fn group_strings_by_tag(strings: &[FoundString]) -> BTreeMap<String, Vec<&FoundString>> {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn group_strings_by_tag(strings: &[FoundString]) -> BTreeMap<String, Vec<&FoundString>> { | |
| let mut grouped: BTreeMap<String, Vec<&FoundString>> = BTreeMap::new(); | |
| for item in strings { | |
| let tag = item | |
| .tags | |
| .first() | |
| .map(|tag| tag_name(tag).to_string()) | |
| .unwrap_or_else(|| "untagged".to_string()); | |
| grouped.entry(tag).or_default().push(item); | |
| } | |
| grouped | |
| } | |
| /// Groups strings by their primary tag (first tag only). | |
| /// | |
| /// Strings with no tags are grouped under "untagged". | |
| /// If a string has multiple tags, only the first tag is used for grouping. | |
| fn group_strings_by_tag(strings: &[FoundString]) -> BTreeMap<String, Vec<&FoundString>> { | |
| let mut grouped: BTreeMap<String, Vec<&FoundString>> = BTreeMap::new(); | |
| for item in strings { | |
| let tag = item | |
| .tags | |
| .first() | |
| .map(|tag| tag_name(tag).to_string()) | |
| .unwrap_or_else(|| "untagged".to_string()); | |
| grouped.entry(tag).or_default().push(item); | |
| } | |
| grouped | |
| } |
🤖 Prompt for AI Agents
In `@src/output/yara.rs` around lines 251 - 264, Add a short doc comment above the
group_strings_by_tag function explaining that when a FoundString has multiple
tags only the first tag (via tag_name on the first entry) is used for grouping
and that entries with no tags are grouped under "untagged"; reference the
function name group_strings_by_tag, the FoundString type, and the tag_name
helper in the comment so future maintainers understand this deliberate
simplification.
| let metadata = OutputMetadata::new("test.bin".to_string(), OutputFormat::Yara, 0, 0); | ||
| // Note: generated_at is None | ||
| let output = format_yara(&[], &metadata).expect("Formatting should succeed"); | ||
|
|
||
| // Should contain a timestamp in the generated_at field | ||
| assert!(output.contains("generated_at = \"")); | ||
| // Timestamp should be numeric (or CLOCK_ERROR in exceptional cases) | ||
| assert!( | ||
| output.contains("generated_at = \"1") | ||
| || output.contains("generated_at = \"CLOCK_ERROR"), | ||
| "Timestamp should be numeric or CLOCK_ERROR" | ||
| ); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_utf16le_hex_string_basic() { | ||
| // Basic ASCII - should be little-endian (ASCII byte followed by 00) | ||
| assert_eq!(utf16le_hex_string("A"), "{ 41 00 }"); | ||
| assert_eq!(utf16le_hex_string("AB"), "{ 41 00 42 00 }"); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_utf16le_hex_string_empty() { | ||
| assert_eq!(utf16le_hex_string(""), "{ }"); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_utf16le_hex_string_non_ascii() { | ||
| // Non-ASCII Unicode (BMP) - Chinese character U+4E2D | ||
| let chinese = utf16le_hex_string("\u{4E2D}"); | ||
| assert_eq!(chinese, "{ 2d 4e }"); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_utf16le_hex_string_surrogate_pair() { | ||
| // Character requiring surrogate pair (outside BMP) - emoji U+1F600 | ||
| let emoji = utf16le_hex_string("\u{1F600}"); | ||
| // Should produce surrogate pair: 3D D8 00 DE (little-endian) | ||
| assert_eq!(emoji, "{ 3d d8 00 de }"); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_utf16le_ascii_uses_wide_modifier() { | ||
| // ASCII UTF-16LE should use "wide" modifier | ||
| let mut string = make_string("test"); | ||
| string.encoding = Encoding::Utf16Le; | ||
| let output = format_yara(&[string], &make_metadata()).expect("Formatting should succeed"); | ||
| assert!( | ||
| output.contains("wide"), | ||
| "ASCII UTF-16LE should use wide modifier" | ||
| ); | ||
| assert!(output.contains("\"test\"")); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_utf16le_non_ascii_uses_hex() { | ||
| // Non-ASCII UTF-16LE should use hex string, not wide modifier | ||
| let mut string = make_string("\u{4E2D}\u{6587}"); | ||
| string.encoding = Encoding::Utf16Le; | ||
| let output = format_yara(&[string], &make_metadata()).expect("Formatting should succeed"); | ||
| assert!( | ||
| !output.contains("wide"), | ||
| "Non-ASCII UTF-16LE should not use wide modifier" | ||
| ); | ||
| assert!( | ||
| output.contains("{ 2d 4e 87 65 }"), | ||
| "Non-ASCII UTF-16LE should use hex encoding" | ||
| ); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_binary_name_injection_escaped_in_comments() { | ||
| // Binary name with newlines should be escaped in comments | ||
| let mut metadata = make_metadata(); | ||
| metadata.binary_name = "evil\nname".to_string(); | ||
| let output = format_yara(&[], &metadata).expect("Formatting should succeed"); | ||
| // Should contain escaped newline, not literal | ||
| assert!( | ||
| output.contains("evil\\nname"), | ||
| "Newlines in binary_name should be escaped" | ||
| ); | ||
| assert!( | ||
| !output.contains("evil\nname"), | ||
| "Literal newlines should not appear" | ||
| ); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_timestamp_injection_escaped_in_meta() { | ||
| // Timestamp with special characters should be escaped | ||
| let mut metadata = make_metadata(); | ||
| metadata.generated_at = Some("2024\"\n//attack".to_string()); | ||
| let output = format_yara(&[], &metadata).expect("Formatting should succeed"); | ||
| // Should contain escaped characters | ||
| assert!( | ||
| output.contains("2024\\\"\\n//attack"), | ||
| "Special chars in timestamp should be escaped" | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Consider extracting tests to reduce file length.
The file is 536 lines, exceeding the 500-line guideline. The tests (~270 lines) could be moved to a separate tests/ integration test file or yara_tests.rs submodule. The main module would then be ~266 lines, well within limits.
This is a minor concern since the tests provide good coverage and the overage is small.
🤖 Prompt for AI Agents
In `@src/output/yara.rs` around lines 266 - 536, The module-level #[cfg(test)] mod
tests block (containing test functions like test_sanitize_rule_name,
test_escape_yara_string, test_utf16be_hex_string_*, etc.) makes the source file
exceed the 500-line guideline; extract that entire tests block into a separate
test module (e.g., yara_tests) so the main module only contains implementation
code. Create a new test file/module containing the same use imports and test
functions, keep the helper fns (make_metadata, make_string) inside that test
module, and remove the #[cfg(test)] mod tests block from this file; ensure the
moved tests still reference the same public functions (sanitize_rule_name,
escape_yara_string, escape_yara_unicode_literal, utf16be_hex_string,
utf16le_hex_string, format_yara, group_strings_by_tag) so they compile
unchanged.
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@TESTING_ANALYSIS.md`:
- Around line 17-18: Replace the bolded pseudo-headings (e.g., "**Coverage by
Module**" and the other bolded titles found elsewhere) with proper Markdown ATX
headings (use ### or #### as appropriate for the document hierarchy) so they
appear in the document outline; search for the bold patterns like "**...**" used
as section titles and convert them to heading lines (for example change
"**Coverage by Module**" to "### Coverage by Module") keeping the text identical
except for removing the bold markers and adding the leading hashes.
- Around line 7-12: Update the hard-coded numeric metrics block (the lines
containing "**Total Tests**", "**Test Pass Rate**", "**Test Coverage**", "**Test
Modules**", and "**Fixtures**" in TESTING_ANALYSIS.md) to include a "Last
verified" date and the CI job or command used to produce these numbers; also
apply the same change to the repeated metrics around lines 472-477.
Specifically, append a short line like "Last verified: YYYY-MM-DD — generated
by: <CI job name or command>" immediately under the metrics so future readers
know when and how the counts were produced.
- Around line 67-79: The fenced code blocks in TESTING_ANALYSIS.md lack language
identifiers causing markdown lint and readability issues; update each
triple-backtick fence to include the correct language (e.g., ```text for
file/path listings, ```rust for Rust snippets) for all examples referencing
extraction::StringExtractor, extraction::BasicExtractor, the main function
examples (fn main / fn main() -> Result<(), Box<dyn std::error::Error>>), and
the test snippets (tests/test_deduplication_performance.rs,
tests/test_extraction_edge_cases.rs); ensure every opened ``` is replaced with
the appropriate ```rust or ```text and that the Rust examples reflect the
corrected function signature and Ok(()) where suggested.
| - **Total Tests**: 535 tests (280 unit + 219 integration + 36 ignored/doctest) | ||
| - **Test Pass Rate**: 98.9% (529 passed, 6 failed/ignored) | ||
| - **Test Coverage**: 6,106 test lines vs 14,138 source lines (43% ratio) | ||
| - **Test Modules**: 24 modules with unit tests | ||
| - **Fixtures**: 5 binary fixtures (ELF, Mach-O, PE with/without resources) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a timestamp or source for the numeric metrics.
Hard-coded counts, pass rates, and timings will go stale. Add a "Last verified" date and the CI job or command that produced the numbers.
Proposed fix
-### Overall Test Health: STRONG with Minor Gaps
+### Overall Test Health: STRONG with Minor Gaps
+Last verified: 2026-01-18 (source: CI job "tests", command: `cargo test --all`)
- **Total Tests**: 535 tests (280 unit + 219 integration + 36 ignored/doctest)
- **Test Pass Rate**: 98.9% (529 passed, 6 failed/ignored)
- **Test Coverage**: 6,106 test lines vs 14,138 source lines (43% ratio)
- **Test Modules**: 24 modules with unit tests
- **Fixtures**: 5 binary fixtures (ELF, Mach-O, PE with/without resources)-**Test Suite Speed**: FAST
+**Test Suite Speed**: FAST
+Last verified: 2026-01-18 (source: CI job "tests", command: `cargo test --all`)Also applies to: 472-477
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
11-11: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🤖 Prompt for AI Agents
In `@TESTING_ANALYSIS.md` around lines 7 - 12, Update the hard-coded numeric
metrics block (the lines containing "**Total Tests**", "**Test Pass Rate**",
"**Test Coverage**", "**Test Modules**", and "**Fixtures**" in
TESTING_ANALYSIS.md) to include a "Last verified" date and the CI job or command
used to produce these numbers; also apply the same change to the repeated
metrics around lines 472-477. Specifically, append a short line like "Last
verified: YYYY-MM-DD — generated by: <CI job name or command>" immediately under
the metrics so future readers know when and how the counts were produced.
| **Coverage by Module**: | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace bolded pseudo-headings with proper Markdown headings.
These lines are styled as headings using emphasis, which breaks the document outline. Use ### or #### headings instead.
Proposed fix
-**Coverage by Module**:
+#### Coverage by Module
-**Test Files**:
+#### Test Files
-**Snapshot Testing**: Using `insta` for output validation
+#### Snapshot Testing
+Using `insta` for output validation
-**Test Fixtures**: Well-organized in `tests/fixtures/`
+#### Test Fixtures
+Well-organized in `tests/fixtures/`
-**Good Examples**:
+#### Good Examples
-**Well-Tested Edge Cases**:
+#### Well-Tested Edge Cases
-**Positive Findings**:
+#### Positive Findings
-**Snapshot Testing**:
+#### Snapshot Testing
-**Strengths**:
+#### Strengths
-**Weaknesses**:
+#### Weaknesses
-**Strengths**:
+#### Strengths
-**Critical Weaknesses**:
+#### Critical WeaknessesAlso applies to: 27-28, 45-45, 51-51, 179-180, 195-196, 214-215, 225-226, 289-290, 311-312, 422-423, 430-431
🤖 Prompt for AI Agents
In `@TESTING_ANALYSIS.md` around lines 17 - 18, Replace the bolded pseudo-headings
(e.g., "**Coverage by Module**" and the other bolded titles found elsewhere)
with proper Markdown ATX headings (use ### or #### as appropriate for the
document hierarchy) so they appear in the document outline; search for the bold
patterns like "**...**" used as section titles and convert them to heading lines
(for example change "**Coverage by Module**" to "### Coverage by Module")
keeping the text identical except for removing the bold markers and adding the
leading hashes.
| ```text | ||
| src\extraction\mod.rs - extraction::StringExtractor (line 318) | ||
| src\extraction\mod.rs - extraction::BasicExtractor (line 408) | ||
| ``` | ||
|
|
||
| **Problem**: Doctests use `?` operator without proper return type: | ||
|
|
||
| ```rust | ||
| fn main() { // Should be: fn main() -> Result<(), Box<dyn std::error::Error>> { | ||
| let data = std::fs::read("binary_file")?; // Error: can't use ? in fn returning () | ||
| ... | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add language identifiers to fenced code blocks.
These code fences are missing language tags, which violates the markdown rules and hurts readability. Add explicit languages like text or rust as appropriate.
Proposed fix
-```text
+```text
src\extraction\mod.rs - extraction::StringExtractor (line 318)
src\extraction\mod.rs - extraction::BasicExtractor (line 408)
```diff
-```rust
+```rust
fn main() { // Should be: fn main() -> Result<(), Box<dyn std::error::Error>> {
let data = std::fs::read("binary_file")?; // Error: can't use ? in fn returning ()
...
}
```diff
-```rust
+```rust
fn main() -> Result<(), Box<dyn std::error::Error>> {
let _cli = Cli::parse();
// TODO: Implement main extraction pipeline
println!("Stringy - Binary string extraction tool");
println!("Implementation coming soon...");
Ok(())
}
```diff
-```rust
+```rust
if section_offset >= data.len() {
return Ok(Vec::new());
}
let end_offset = section_offset
.checked_add(section_size)
.unwrap_or(data.len())
.min(data.len());
```diff
-```rust
+```rust
#[test]
fn test_section_beyond_file_boundary() {
// Section offset > data.len()
}
#[test]
fn test_section_size_overflow() {
// section_offset + section_size overflows
}
```diff
-```rust
+```rust
// In src/extraction/mod.rs (line 318 and 408)
// Change: fn main() {
// To: fn main() -> Result<(), Box<dyn std::error::Error>> {
// Add: Ok(()) at end of function
```diff
-```rust
+```rust
// tests/test_deduplication_performance.rs
#[test]
#[ignore] // Marked as ignored for normal runs
fn test_deduplication_large_input() {
// Test with 10,000 duplicate strings
}
```diff
-```rust
+```rust
// tests/test_extraction_edge_cases.rs
#[test]
fn test_section_beyond_boundary() {
// Section offset > data.len()
}
</details>
Also applies to: 116-125, 136-145, 161-171, 339-344, 348-355, 359-365
<details>
<summary>🤖 Prompt for AI Agents</summary>
In @TESTING_ANALYSIS.md around lines 67 - 79, The fenced code blocks in
TESTING_ANALYSIS.md lack language identifiers causing markdown lint and
readability issues; update each triple-backtick fence to include the correct
language (e.g., text for file/path listings, rust for Rust snippets) for
all examples referencing extraction::StringExtractor,
extraction::BasicExtractor, the main function examples (fn main / fn main() ->
Result<(), Box>), and the test snippets
(tests/test_deduplication_performance.rs, tests/test_extraction_edge_cases.rs);
ensure every opened is replaced with the appropriaterust or ```text and
that the Rust examples reflect the corrected function signature and Ok(()) where
suggested.
</details>
<!-- fingerprinting:phantom:triton:eagle -->
<!-- This is an auto-generated comment by CodeRabbit -->
- Added 'moby' configuration to Docker feature - Removed unused features from the devcontainer configuration - Added mise.toml for tool dependencies Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In @.devcontainer/devcontainer.json:
- Line 48: Add a trailing newline to the end of the file so it ends with a
newline character (POSIX convention). Open the .devcontainer/devcontainer.json
file and insert a single newline at EOF so the file terminates with '\n'; no
code changes beyond adding that final newline are required.
- Around line 40-41: The extensions list includes both "1YiB.rust-bundle" and
"rust-lang.rust-analyzer", which overlap; decide which to keep and remove the
duplicate entry accordingly: if you want the bundled set keep "1YiB.rust-bundle"
and delete "rust-lang.rust-analyzer", otherwise keep the official
"rust-lang.rust-analyzer" and remove "1YiB.rust-bundle"; update the "extensions"
array in devcontainer.json to contain only the chosen extension and, if
necessary, update README or comments to document the rationale.
- Around line 5-23: The devcontainer features currently set "version": "latest"
(e.g. entries for "ghcr.io/devcontainers/features/docker-outside-of-docker:1",
"ghcr.io/devcontainers/features/github-cli:1",
"ghcr.io/eitsupi/devcontainer-features/mdbook:1",
"ghcr.io/devcontainers-extra/features/claude-code:1", and
"ghcr.io/devcontainers-extra/features/mise:1") should be pinned to concrete
version strings for reproducibility; update each feature's "version" value to a
specific release/tag (or add a comment listing the expected version) and, where
available, replace "latest" with the exact version you validated locally to
ensure deterministic dev environments.
In `@justfile`:
- Around line 55-66: Replace the duplicated mise install steps in both the setup
and [unix] setup recipes by invoking the existing mise-install recipe: remove
the repeated mise trust/mise install lines and call the mise-install recipe from
setup and the [unix] setup recipe (keep any environment-specific lines like
rustup component additions or the mdformat-install/Write-Host lines as needed),
so both variants reuse the single mise-install implementation and avoid drift.
In `@mise.toml`:
- Around line 1-15: The tools table in mise.toml pins versions except for claude
which is set to "latest", causing non-deterministic installs; update the claude
entry to a specific version string (match the project's versioning scheme used
for other tools) by replacing claude = "latest" with a pinned version (e.g.,
claude = "<x.y.z>") so installs are reproducible; ensure the chosen version is
compatible with CI and document the selected version in the repo (reference: the
claude entry in the tools block).
| "ghcr.io/devcontainers/features/docker-outside-of-docker:1": { | ||
| "installDockerBuildx": true, | ||
| "version": "latest", | ||
| "dockerDashComposeVersion": "v2", | ||
| "moby": false | ||
| }, | ||
| "ghcr.io/devcontainers/features/github-cli:1": { | ||
| "installDirectlyFromGitHubRelease": true, | ||
| "version": "latest" | ||
| }, | ||
| "ghcr.io/eitsupi/devcontainer-features/mdbook:1": { | ||
| "version": "latest" | ||
| }, | ||
| "ghcr.io/devcontainers-extra/features/claude-code:1": { | ||
| "version": "latest" | ||
| }, | ||
| "ghcr.io/devcontainers-extra/features/mise:1": { | ||
| "version": "latest" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Consider pinning feature versions for reproducible dev environments.
All features use version: "latest", which can cause inconsistent environments across team members or over time. For a binary analysis tool, reproducible builds and test environments reduce debugging headaches.
Pin to specific versions or at minimum document the expected versions in a comment.
Example version pinning
"ghcr.io/devcontainers/features/docker-outside-of-docker:1": {
"installDockerBuildx": true,
- "version": "latest",
+ "version": "27.4",
"dockerDashComposeVersion": "v2",
"moby": false
},
"ghcr.io/devcontainers/features/github-cli:1": {
"installDirectlyFromGitHubRelease": true,
- "version": "latest"
+ "version": "2.64"
},🤖 Prompt for AI Agents
In @.devcontainer/devcontainer.json around lines 5 - 23, The devcontainer
features currently set "version": "latest" (e.g. entries for
"ghcr.io/devcontainers/features/docker-outside-of-docker:1",
"ghcr.io/devcontainers/features/github-cli:1",
"ghcr.io/eitsupi/devcontainer-features/mdbook:1",
"ghcr.io/devcontainers-extra/features/claude-code:1", and
"ghcr.io/devcontainers-extra/features/mise:1") should be pinned to concrete
version strings for reproducibility; update each feature's "version" value to a
specific release/tag (or add a comment listing the expected version) and, where
available, replace "latest" with the exact version you validated locally to
ensure deterministic dev environments.
| ] | ||
| } | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Add trailing newline.
File ends without a newline. POSIX convention and most linters expect files to end with a newline character.
🤖 Prompt for AI Agents
In @.devcontainer/devcontainer.json at line 48, Add a trailing newline to the
end of the file so it ends with a newline character (POSIX convention). Open the
.devcontainer/devcontainer.json file and insert a single newline at EOF so the
file terminates with '\n'; no code changes beyond adding that final newline are
required.
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Extract YARA string escaping utilities to separate escaping.rs module: - escape_yara_string, escape_yara_unicode_literal - utf16be_hex_string, utf16le_hex_string - All associated tests The original yara.rs was 536 lines, now split into: - yara/mod.rs (357 lines) - main formatting logic - yara/escaping.rs (204 lines) - escaping utilities Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Fix formatting bug where rules 3 and 4 were merged on one line, correct typos, add Rust version requirements, expand development commands, clarify module structure, and document key dependencies. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.vscode/settings.json:
- Around line 1-12: The .vscode/settings.json contains unused Ruff and local
tool paths ("ruff.path", "ruff.interpreter", "python.defaultInterpreterPath",
"debug.javascript.defaultRuntimeExecutable") that are unclear or dead; either
remove the Ruff entries if unused or wire Ruff into the repository toolchain by
adding it to .pre-commit-config.yaml and CI, and if Python/Node under
.vscode/mise-tools are required for docs/builds/tests, document the purpose and
the steps to populate .vscode/mise-tools in README or CONTRIBUTING (refer to
these specific keys when editing) so developers know when to install those local
interpreters and runtimes.
In `@AGENTS.md`:
- Line 17: The line "**Rust**: Edition 2024, MSRV 1.91" is using emphasis as a
heading; replace it with a real Markdown heading or list item (e.g., change to
"### Rust — Edition 2024, MSRV 1.91" or use "- Rust: Edition 2024, MSRV 1.91")
so markdownlint no longer flags it; locate the exact text "**Rust**: Edition
2024, MSRV 1.91" and update it accordingly.
- Fix test metadata count mismatches in output_table_integration.rs - Remove underscore prefix from used parameters in json.rs - Use crate re-export for RankingConfig import - Fix truncate_string edge case for wide multibyte characters - Update table docs to reflect actual tag selection behavior - Add control character sanitization for TTY output - Use UTF-8 encoding for Unicode test content - Add Claude Code AI assistance section to CONTRIBUTING.md - Update snapshots for encoding and sanitization changes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove duplicate rust-bundle extension, keep official rust-analyzer - Refactor setup recipes to reuse mise-install recipe - Fix markdown lint: convert emphasis-as-heading to list items Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add mise_exec variable and prefix all tool commands (cargo, prettier, pre-commit, actionlint, cspell, markdownlint, lychee, dist, mdbook, goreleaser) with mise exec to ensure correct tool versions are used regardless of shell activation state. Also adds dotenv-load and ignore-comments settings for consistency with other projects. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/output_table_integration.rs`:
- Around line 302-308: The test test_plain_unicode_strings constructs
Unicode-containing cells using make_string but currently uses Encoding::Ascii
(via make_string), which is incorrect; update the calls to make_string to use an
appropriate Unicode encoding (e.g., Encoding::Utf8) and ensure the associated
metadata from make_metadata reflects UTF-8 encoding so encoding handling is
exercised correctly when calling format_table_with_mode; locate the make_string
calls inside test_plain_unicode_strings and change the encoding argument and/or
the metadata returned by make_metadata to UTF-8/Unicode.
♻️ Duplicate comments (5)
CONTRIBUTING.md (1)
60-63: Consolidate duplicate snapshot guidance.Lines 61 and 63 repeat the same instruction. Keep one concise bullet to avoid drift.
Proposed fix
- - Use insta snapshots for output verification when appropriate. - - Integration tests live in tests/ and fixtures in tests/fixtures/. - - Use insta snapshots for output verification when changing output formatters. + - Use insta snapshots for output verification when appropriate, especially when changing output formatters. + - Integration tests live in tests/ and fixtures in tests/fixtures/.TESTING_ANALYSIS.md (2)
7-12: Add last-verified metadata for numeric test metrics.The counts and pass rates will go stale. Add a date and source near this block and repeat for the later metrics block around Lines 472-477.
Proposed fix
- **Fixtures**: 5 binary fixtures (ELF, Mach-O, PE with/without resources) +Last verified: 2026-01-18 - source: CI job "tests", command: `cargo test --all`
17-28: Replace bold pseudo-headings with proper Markdown headings.Bold text is being used as headings, which breaks the document outline. Convert these to ATX headings and apply the change consistently.
Proposed fix
-**Coverage by Module**: +#### Coverage by Module @@ -**Test Files**: +#### Test FilesAs per coding guidelines, use proper Markdown headings.
src/output/table/formatting.rs (1)
6-7: Use crate re-export for RankingConfig import.The deep path import violates the re-export policy. Switch to the crate re-export, or add one in lib.rs if missing.
Proposed fix
-use crate::classification::RankingConfig; +use crate::RankingConfig;As per coding guidelines, use crate re-exports for imports.
src/output/table/tty.rs (1)
78-94: Section column needs sanitization and truncation.Raw section names can include control characters or exceed the capped width, which breaks table alignment. Sanitize and truncate to section_width before padding.
Proposed fix
- let section_display = found_string.section.as_deref().unwrap_or(""); + let section_display = found_string.section.as_deref().unwrap_or(""); + let sanitized_section = sanitize_for_display(section_display); + let truncated_section = truncate_string(&sanitized_section, section_width); @@ - pad_string(section_display, section_width, Alignment::Left), + pad_string(&truncated_section, section_width, Alignment::Left),
| fn test_plain_unicode_strings() { | ||
| let strings = vec![ | ||
| make_string("\u{4E2D}\u{6587}\u{5B57}\u{7B26}\u{4E32}"), // Chinese characters | ||
| make_string("\u{0420}\u{0443}\u{0441}\u{0441}\u{043A}\u{0438}\u{0439}"), // Russian | ||
| make_string("\u{1F600}\u{1F601}\u{1F602}"), // Emojis | ||
| ]; | ||
| let result = format_table_with_mode(&strings, &make_metadata(3), false).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix encoding metadata for Unicode test strings.
make_string uses Encoding::Ascii, which mismatches the Unicode content in this test and can mask encoding handling issues.
Proposed fix
- let strings = vec![
- make_string("\u{4E2D}\u{6587}\u{5B57}\u{7B26}\u{4E32}"), // Chinese characters
- make_string("\u{0420}\u{0443}\u{0441}\u{0441}\u{043A}\u{0438}\u{0439}"), // Russian
- make_string("\u{1F600}\u{1F601}\u{1F602}"), // Emojis
- ];
+ let chinese = "\u{4E2D}\u{6587}\u{5B57}\u{7B26}\u{4E32}";
+ let russian = "\u{0420}\u{0443}\u{0441}\u{0441}\u{043A}\u{0438}\u{0439}";
+ let emoji = "\u{1F600}\u{1F601}\u{1F602}";
+ let strings = vec![
+ FoundString::new(
+ chinese.to_string(),
+ Encoding::Utf8,
+ 0x1000,
+ chinese.len() as u32,
+ StringSource::SectionData,
+ ),
+ FoundString::new(
+ russian.to_string(),
+ Encoding::Utf8,
+ 0x1001,
+ russian.len() as u32,
+ StringSource::SectionData,
+ ),
+ FoundString::new(
+ emoji.to_string(),
+ Encoding::Utf8,
+ 0x1002,
+ emoji.len() as u32,
+ StringSource::SectionData,
+ ),
+ ];🤖 Prompt for AI Agents
In `@tests/output_table_integration.rs` around lines 302 - 308, The test
test_plain_unicode_strings constructs Unicode-containing cells using make_string
but currently uses Encoding::Ascii (via make_string), which is incorrect; update
the calls to make_string to use an appropriate Unicode encoding (e.g.,
Encoding::Utf8) and ensure the associated metadata from make_metadata reflects
UTF-8 encoding so encoding handling is exercised correctly when calling
format_table_with_mode; locate the make_string calls inside
test_plain_unicode_strings and change the encoding argument and/or the metadata
returned by make_metadata to UTF-8/Unicode.
This pull request primarily updates documentation and adds new output formatting capabilities. The most significant change is the introduction of a new JSON output module, along with the re-export of output formatting types. Additionally, several documentation examples are updated for clarity and consistency, particularly regarding string extraction and construction.
Output formatting improvements:
format_jsonfunction insrc/output/json.rsfor formatting extracted strings as JSONL, including comprehensive tests for serialization, escaping, and error handling.OutputFormat,OutputMetadata,format_json,format_output,format_table_with_mode,format_yara) insrc/lib.rsfor easier access by library users.Documentation updates and consistency improvements:
src/classification/mod.rs,src/classification/semantic.rs,src/classification/symbols.rs) to use theFoundString::newconstructor instead of manual struct initialization, improving clarity and reducing boilerplate. [1] [2] [3] [4]extract_utf16_strings(formerlyextract_utf16le_strings) and improved example code for section-aware extraction and PE resource extraction, including proper function signatures and error handling. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]These changes improve the usability of the library for both developers and end users by providing better output options and clearer documentation.