Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions .goreleaser.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# This is an example .goreleaser.yml file with some sensible defaults.
# Make sure to check the documentation at https://goreleaser.com

# The lines below are called `modelines`. See `:help modeline`
# Feel free to remove those if you don't want/need to use them.
# yaml-language-server: $schema=https://goreleaser.com/static/schema.json
# vim: set ts=2 sw=2 tw=0 fo=cnqoj

version: 2

before:
hooks:
# Ensure cargo-zigbuild is available for cross-compilation
# Note: rustup toolchain is pinned via rust-toolchain.toml
- cargo install --locked cargo-zigbuild
- cargo fetch --locked

builds:
# macOS targets - use regular cargo (zigbuild has issues with macOS linker flags)
- builder: rust
id: darwin
command: build
flags:
- --release
targets:
- x86_64-apple-darwin
- aarch64-apple-darwin

# Linux/Windows targets - use cargo-zigbuild for cross-compilation
- builder: rust
id: cross
command: zigbuild
flags:
- --release
targets:
- x86_64-unknown-linux-gnu
- aarch64-unknown-linux-gnu
- x86_64-pc-windows-gnu

archives:
- formats: [tar.gz]
# this name template makes the OS and Arch compatible with the results of `uname`.
name_template: >-
{{ .ProjectName }}_
{{- title .Os }}_
{{- if eq .Arch "amd64" }}x86_64
{{- else if eq .Arch "386" }}i386
{{- else }}{{ .Arch }}{{ end }}
# use zip for windows archives
format_overrides:
- goos: windows
formats: [zip]

changelog:
sort: asc
filters:
exclude:
- "^docs:"
- "^test:"

release:
footer: >-

---

Released by [GoReleaser](https://github.com/goreleaser/goreleaser).
18 changes: 16 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,19 @@ Use `thiserror` with detailed context. Include offsets, section names, and file

### Public API Structs

Use `#[non_exhaustive]` for public structs and provide explicit constructors.
Use `#[non_exhaustive]` for public structs and provide explicit constructors. When using `#[non_exhaustive]` structs internally, always use the constructor pattern (`Type::new()`) rather than struct literals - struct literals bypass the forward-compatibility guarantee.

### Test-Only Code

For test utilities that shouldn't be in production builds:

- Add `#[cfg(test)]` to both the struct/type definition AND any impl blocks
- Use `pub(crate)` visibility for internal test helpers
- Keep test infrastructure in `#[cfg(test)] mod tests` blocks within the module

### Regex Patterns

Use `lazy_static!` or `once_cell::sync::Lazy` for compiled regexes. Always use `.expect("descriptive message")` instead of `.unwrap()` for regex compilation - invalid regex patterns should fail fast with clear error messages.

## Development Commands

Expand Down Expand Up @@ -75,8 +87,10 @@ Import from `stringy::extraction` or `stringy::types`, not deeply nested paths.

## Adding Features

**New semantic tag**: Add variant to `Tag` enum in `types.rs`, implement pattern in `classification/semantic.rs`
**New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`

Comment on lines 88 to 91
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix MD036 by using a heading instead of bold text.
Markdownlint reports emphasis used as a heading. Convert the bold label to a heading or list item.

Proposed fix
-**New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`
+### New semantic tag
+Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Adding Features
**New semantic tag**: Add variant to `Tag` enum in `types.rs`, implement pattern in `classification/semantic.rs`
**New semantic tag**: Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`
## Adding Features
### New semantic tag
Add variant to `Tag` enum in `types/mod.rs`, implement pattern in `classification/patterns/` or `classification/mod.rs`
🤖 Prompt for AI Agents
In `@AGENTS.md` around lines 88 - 91, Replace the bold label "**New semantic
tag**:" with a proper Markdown heading (e.g., "### New semantic tag") or a list
item to satisfy MD036; keep the rest of the sentence ("Add variant to `Tag` enum
in `types/mod.rs`, implement pattern in `classification/patterns/` or
`classification/mod.rs`") unchanged and ensure the heading level aligns with
surrounding headings in AGENTS.md so references to the Tag enum (`Tag`),
types/mod.rs, and classification/patterns/ or classification/mod.rs remain
clear.

**New section weight**: Add match arm in the relevant `container/*.rs` parser

**New string extractor**: Follow patterns in `extraction/` module

**Splitting large files**: When a file exceeds 500 lines, convert to a module directory: `foo.rs` -> `foo/mod.rs` + `foo/submodule.rs`. Move related code to submodules while keeping public re-exports in `mod.rs`.
8 changes: 6 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ path = "src/main.rs"
[dependencies]
clap = { version = "4.5.54", features = [ "derive" ] }
cpp_demangle = "0.5.1"
entropy = "0.4.2"
entropy = "0.4.3"
goblin = "0.10.4"
once_cell = "1.21.3"
pelite = "0.10.0"
regex = "1.12.2"
rustc-demangle = "0.1.27"
serde = { version = "1.0.228", features = [ "derive" ] }
serde_json = "1.0.149"
thiserror = "2.0.17"
thiserror = "2.0.18"

[dev-dependencies]
criterion = "0.8.1"
Expand All @@ -46,6 +46,10 @@ lto = "thin"
name = "elf"
harness = false

[[bench]]
name = "classification"
harness = false

[[bench]]
name = "pe"
harness = false
Expand Down
136 changes: 136 additions & 0 deletions benches/classification.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
use criterion::{Criterion, criterion_group, criterion_main};
use std::hint::black_box;
use stringy::classification::SemanticClassifier;
use stringy::types::{BinaryFormat, Encoding, SectionType, StringContext, StringSource};

fn make_context() -> StringContext {
StringContext::new(
SectionType::StringData,
BinaryFormat::Elf,
Encoding::Ascii,
StringSource::SectionData,
)
.with_section_name(".rodata".to_string())
}

fn bench_classifier_construction(c: &mut Criterion) {
c.bench_function("classification_classifier_construction", |b| {
b.iter(|| {
let _ = SemanticClassifier::new();
});
});
}

fn bench_guid_classification(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();
let guid = "{12345678-1234-1234-1234-123456789abc}";

c.bench_function("classification_guid", |b| {
b.iter(|| {
let _ = classifier.classify(black_box(guid), &context);
});
});
}

fn bench_email_classification(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();
let email = "user.name+tag@example.co.uk";

c.bench_function("classification_email", |b| {
b.iter(|| {
let _ = classifier.classify(black_box(email), &context);
});
});
}

fn bench_base64_classification(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();
let base64 = "U29tZSBsb25nZXIgYmFzZTY0IHN0cmluZw==";

c.bench_function("classification_base64", |b| {
b.iter(|| {
let _ = classifier.classify(black_box(base64), &context);
});
});
}

fn bench_format_string_classification(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();
let format_string = "Error: %s at line %d";

c.bench_function("classification_format_string", |b| {
b.iter(|| {
let _ = classifier.classify(black_box(format_string), &context);
});
});
}

fn bench_user_agent_classification(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();
let user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)";

c.bench_function("classification_user_agent", |b| {
b.iter(|| {
let _ = classifier.classify(black_box(user_agent), &context);
});
});
}

fn bench_batch_classification(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();

let mut samples = Vec::new();
for index in 0..1000 {
samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index));
samples.push(format!("user{}@example.com", index));
samples.push(format!("Error %s at line {}", index));
}

c.bench_function("classification_batch", |b| {
b.iter(|| {
for sample in &samples {
let _ = classifier.classify(black_box(sample.as_str()), &context);
}
});
});
}
Comment on lines +84 to +102
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Batch sample misses a format-string match.

"Error %s at line {}" uses {} which does not match the documented format-string patterns that rely on % specifiers or {digits}. This reduces the intended mix of categories in the batch set.

Proposed fix
-        samples.push(format!("Error %s at line {}", index));
+        samples.push(format!("Error %s at line %d {}", index));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn bench_batch_classification(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();
let mut samples = Vec::new();
for index in 0..1000 {
samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index));
samples.push(format!("user{}@example.com", index));
samples.push(format!("Error %s at line {}", index));
}
c.bench_function("classification_batch", |b| {
b.iter(|| {
for sample in &samples {
let _ = classifier.classify(black_box(sample.as_str()), &context);
}
});
});
}
fn bench_batch_classification(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();
let mut samples = Vec::new();
for index in 0..1000 {
samples.push(format!("{{12345678-1234-1234-1234-{:012x}}}", index));
samples.push(format!("user{}@example.com", index));
samples.push(format!("Error %s at line %d {}", index));
}
c.bench_function("classification_batch", |b| {
b.iter(|| {
for sample in &samples {
let _ = classifier.classify(black_box(sample.as_str()), &context);
}
});
});
}
🤖 Prompt for AI Agents
In `@benches/classification.rs` around lines 84 - 102, In
bench_batch_classification replace the mixed-format string "Error %s at line {}"
so it matches the classifier's documented patterns (use % specifiers or numeric
{digits}); update the format! call that populates samples (inside the loop that
builds samples) to use a numeric placeholder like "Error %s at line {0}" (still
passing index), so the sample vector contains the intended %-style pattern and
the classifier.classify calls (classifier.classify(..., &context)) will see the
correct mix of formats.


fn bench_worst_case(c: &mut Criterion) {
let classifier = SemanticClassifier::new();
let context = make_context();
let worst_case = "x9qz1p0t8v7w6r5y4u3i2o1p-";

c.bench_function("classification_worst_case", |b| {
b.iter(|| {
let _ = classifier.classify(black_box(worst_case), &context);
});
});
}

fn bench_context_creation(c: &mut Criterion) {
c.bench_function("classification_context_creation", |b| {
b.iter(|| {
let _ = make_context();
});
});
}

criterion_group!(
classification_benches,
bench_classifier_construction,
bench_guid_classification,
bench_email_classification,
bench_base64_classification,
bench_format_string_classification,
bench_user_agent_classification,
bench_batch_classification,
bench_worst_case,
bench_context_creation
);
criterion_main!(classification_benches);
Loading
Loading