Skip to content

Implement Semantic Pattern Matching: GUID, Email, Base64, Printf Format, and User Agent Detection #18

@unclesp1d3r

Description

@unclesp1d3r

Overview

Implement semantic pattern matching for the remaining high-value string classifications: GUID, email addresses, Base64 data, printf-style format strings, and user agent strings. These patterns are critical for identifying security indicators and code artifacts in binary analysis.

Current State

  • Tag Enum: Already defined in src/types.rs with Guid, Email, Base64, FormatString, and UserAgent variants
  • Classification Module: Empty stub at src/classification/mod.rs (only contains comment)
  • Documentation: Comprehensive patterns and implementation examples exist in docs/src/classification.md
  • Dependency Gap: Missing regex crate for pattern matching
  • Blocker: Depends on Semantic Classification Framework implementation

Technical Requirements

Dependencies to Add

Add to Cargo.toml:

[dependencies]
regex = "1.10"
lazy_static = "1.4"  # For regex compilation caching

Implementation Details

Create SemanticClassifier struct in src/classification/mod.rs:

use regex::Regex;
use lazy_static::lazy_static;
use crate::types::{Tag, FoundString, StringContext};

pub struct SemanticClassifier {
    guid_regex: Regex,
    email_regex: Regex,
    base64_regex: Regex,
    format_regex: Regex,
    user_agent_regex: Regex,
}

impl SemanticClassifier {
    pub fn new() -> Self {
        // Initialize with compiled regex patterns
    }
    
    pub fn classify(&self, text: &str, context: &StringContext) -> Vec<Tag> {
        // Pattern matching logic
    }
}

Regex Patterns

Based on docs/src/classification.md:

  1. GUID/UUID

    • Pattern: \{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}
    • Example: {12345678-1234-1234-1234-123456789abc}
    • Validation: Format compliance, version field checking
  2. Email Address

    • Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
    • Example: admin@malware.com
    • Validation: RFC compliance, domain validation
  3. Base64

    • Pattern: [A-Za-z0-9+/]{20,}={0,2}
    • Example: SGVsbG8gV29ybGQ=
    • Validation: Length divisibility by 4, padding correctness, minimum length threshold
  4. Printf Format Strings

    • Pattern: %[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}
    • Examples: Error: %s at line %d, User {0} logged in
    • Context: Proximity to other format strings, common in .rodata
  5. User Agent

    • Pattern: Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+
    • Example: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
    • Validation: Known browser identifiers, version format

Acceptance Criteria

  • SemanticClassifier struct implemented with all five pattern types
  • Each pattern has dedicated detection method with confidence scoring
  • Context-aware classification considers section type and binary format
  • False positive reduction through validation (length, entropy, format)
  • Regex patterns compiled once and cached using lazy_static
  • Integration with FoundString type to populate tags field
  • Comprehensive unit tests for each pattern type:
    • Positive matches (valid GUIDs, emails, Base64, format strings, user agents)
    • Negative matches (invalid formats, edge cases)
    • Context-dependent behavior
    • Multi-tag scenarios (string matching multiple patterns)
  • Integration tests with real binary samples
  • Documentation with examples and pattern explanations
  • Benchmark tests for performance validation

Test Coverage Requirements

Unit Tests (tests/classification_tests.rs)

#[test]
fn test_guid_detection() {
    // Valid GUID formats
    // Invalid GUID formats
    // Case sensitivity
}

#[test]
fn test_email_detection() {
    // Valid emails
    // Invalid emails (missing @, invalid TLD)
}

#[test]
fn test_base64_detection() {
    // Valid Base64 (with/without padding)
    // Invalid Base64 (wrong length, invalid characters)
    // Minimum length threshold
}

#[test]
fn test_format_string_detection() {
    // Printf-style: %s, %d, %x
    // Python-style: {0}, {1}
    // Mixed format strings
}

#[test]
fn test_user_agent_detection() {
    // Common browsers
    // Mobile user agents
    // Bot user agents
}

#[test]
fn test_false_positive_reduction() {
    // High-entropy binary data
    // Very short matches
    // Invalid context
}

Integration Tests

Test with real binaries containing these patterns extracted from:

  • ELF binaries with GUIDs in .rodata
  • PE files with user agents in .rdata
  • Mach-O binaries with format strings in __TEXT,__cstring

Performance Considerations

  • Use lazy_static for one-time regex compilation
  • Implement short-circuit evaluation (check simpler patterns first)
  • Consider minimum string length before applying expensive regex
  • Profile regex performance on large binaries

Related Issues

References

  • Detailed patterns: docs/src/classification.md
  • Type definitions: src/types.rs
  • Tag enum: Lines 20-40 in src/types.rs

Implementation Notes

  1. Start with SemanticClassifier struct definition
  2. Implement each pattern matcher as a separate method
  3. Add validation logic for each pattern type
  4. Integrate with existing FoundString and Tag types
  5. Write comprehensive unit tests for each pattern
  6. Add integration tests with real binaries
  7. Optimize with benchmarking
  8. Update documentation with usage examples

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions