-
-
Notifications
You must be signed in to change notification settings - Fork 0
Labels
area:analyzerBinary analyzer functionalityBinary analyzer functionalitylang:rustRust implementationRust implementationneeds:testsNeeds test coverageNeeds test coveragepriority:mediumMedium priority taskMedium priority taskstatus:backlogTask in backlogTask in backlogstory-points: 88 story points8 story pointstype:enhancementNew feature or requestNew feature or request
Milestone
Description
Overview
Implement semantic pattern matching for the remaining high-value string classifications: GUID, email addresses, Base64 data, printf-style format strings, and user agent strings. These patterns are critical for identifying security indicators and code artifacts in binary analysis.
Current State
- Tag Enum: Already defined in
src/types.rswithGuid,Email,Base64,FormatString, andUserAgentvariants - Classification Module: Empty stub at
src/classification/mod.rs(only contains comment) - Documentation: Comprehensive patterns and implementation examples exist in
docs/src/classification.md - Dependency Gap: Missing
regexcrate for pattern matching - Blocker: Depends on Semantic Classification Framework implementation
Technical Requirements
Dependencies to Add
Add to Cargo.toml:
[dependencies]
regex = "1.10"
lazy_static = "1.4" # For regex compilation cachingImplementation Details
Create SemanticClassifier struct in src/classification/mod.rs:
use regex::Regex;
use lazy_static::lazy_static;
use crate::types::{Tag, FoundString, StringContext};
pub struct SemanticClassifier {
guid_regex: Regex,
email_regex: Regex,
base64_regex: Regex,
format_regex: Regex,
user_agent_regex: Regex,
}
impl SemanticClassifier {
pub fn new() -> Self {
// Initialize with compiled regex patterns
}
pub fn classify(&self, text: &str, context: &StringContext) -> Vec<Tag> {
// Pattern matching logic
}
}Regex Patterns
Based on docs/src/classification.md:
-
GUID/UUID
- Pattern:
\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\} - Example:
{12345678-1234-1234-1234-123456789abc} - Validation: Format compliance, version field checking
- Pattern:
-
Email Address
- Pattern:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} - Example:
admin@malware.com - Validation: RFC compliance, domain validation
- Pattern:
-
Base64
- Pattern:
[A-Za-z0-9+/]{20,}={0,2} - Example:
SGVsbG8gV29ybGQ= - Validation: Length divisibility by 4, padding correctness, minimum length threshold
- Pattern:
-
Printf Format Strings
- Pattern:
%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\} - Examples:
Error: %s at line %d,User {0} logged in - Context: Proximity to other format strings, common in
.rodata
- Pattern:
-
User Agent
- Pattern:
Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+ - Example:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) - Validation: Known browser identifiers, version format
- Pattern:
Acceptance Criteria
-
SemanticClassifierstruct implemented with all five pattern types - Each pattern has dedicated detection method with confidence scoring
- Context-aware classification considers section type and binary format
- False positive reduction through validation (length, entropy, format)
- Regex patterns compiled once and cached using
lazy_static - Integration with
FoundStringtype to populatetagsfield - Comprehensive unit tests for each pattern type:
- Positive matches (valid GUIDs, emails, Base64, format strings, user agents)
- Negative matches (invalid formats, edge cases)
- Context-dependent behavior
- Multi-tag scenarios (string matching multiple patterns)
- Integration tests with real binary samples
- Documentation with examples and pattern explanations
- Benchmark tests for performance validation
Test Coverage Requirements
Unit Tests (tests/classification_tests.rs)
#[test]
fn test_guid_detection() {
// Valid GUID formats
// Invalid GUID formats
// Case sensitivity
}
#[test]
fn test_email_detection() {
// Valid emails
// Invalid emails (missing @, invalid TLD)
}
#[test]
fn test_base64_detection() {
// Valid Base64 (with/without padding)
// Invalid Base64 (wrong length, invalid characters)
// Minimum length threshold
}
#[test]
fn test_format_string_detection() {
// Printf-style: %s, %d, %x
// Python-style: {0}, {1}
// Mixed format strings
}
#[test]
fn test_user_agent_detection() {
// Common browsers
// Mobile user agents
// Bot user agents
}
#[test]
fn test_false_positive_reduction() {
// High-entropy binary data
// Very short matches
// Invalid context
}Integration Tests
Test with real binaries containing these patterns extracted from:
- ELF binaries with GUIDs in
.rodata - PE files with user agents in
.rdata - Mach-O binaries with format strings in
__TEXT,__cstring
Performance Considerations
- Use
lazy_staticfor one-time regex compilation - Implement short-circuit evaluation (check simpler patterns first)
- Consider minimum string length before applying expensive regex
- Profile regex performance on large binaries
Related Issues
- Blocks: Implement Main String Extraction Orchestrator and Pipeline Integration #36 (Main String Extraction Orchestrator)
- Blocks: Complete End-to-End Pipeline Integration with Error Recovery and Testing #37 (Complete Pipeline Integration)
- Related: Performance: Implement Regex Compilation Caching for Semantic Classifier #33 (Regex Caching)
References
- Detailed patterns:
docs/src/classification.md - Type definitions:
src/types.rs - Tag enum: Lines 20-40 in
src/types.rs
Implementation Notes
- Start with
SemanticClassifierstruct definition - Implement each pattern matcher as a separate method
- Add validation logic for each pattern type
- Integrate with existing
FoundStringandTagtypes - Write comprehensive unit tests for each pattern
- Add integration tests with real binaries
- Optimize with benchmarking
- Update documentation with usage examples
Metadata
Metadata
Assignees
Labels
area:analyzerBinary analyzer functionalityBinary analyzer functionalitylang:rustRust implementationRust implementationneeds:testsNeeds test coverageNeeds test coveragepriority:mediumMedium priority taskMedium priority taskstatus:backlogTask in backlogTask in backlogstory-points: 88 story points8 story pointstype:enhancementNew feature or requestNew feature or request