Implement Semantic Pattern Matching: GUID, Email, Base64, Printf Format, and User Agent Detection

## Overview

Implement semantic pattern matching for the remaining high-value string classifications: GUID, email addresses, Base64 data, printf-style format strings, and user agent strings. These patterns are critical for identifying security indicators and code artifacts in binary analysis.

## Current State

- **Tag Enum**: Already defined in `src/types.rs` with `Guid`, `Email`, `Base64`, `FormatString`, and `UserAgent` variants
- **Classification Module**: Empty stub at `src/classification/mod.rs` (only contains comment)
- **Documentation**: Comprehensive patterns and implementation examples exist in `docs/src/classification.md`
- **Dependency Gap**: Missing `regex` crate for pattern matching
- **Blocker**: Depends on Semantic Classification Framework implementation

## Technical Requirements

### Dependencies to Add

Add to `Cargo.toml`:
```toml
[dependencies]
regex = "1.10"
lazy_static = "1.4"  # For regex compilation caching
```

### Implementation Details

Create `SemanticClassifier` struct in `src/classification/mod.rs`:

```rust
use regex::Regex;
use lazy_static::lazy_static;
use crate::types::{Tag, FoundString, StringContext};

pub struct SemanticClassifier {
    guid_regex: Regex,
    email_regex: Regex,
    base64_regex: Regex,
    format_regex: Regex,
    user_agent_regex: Regex,
}

impl SemanticClassifier {
    pub fn new() -> Self {
        // Initialize with compiled regex patterns
    }
    
    pub fn classify(&self, text: &str, context: &StringContext) -> Vec<Tag> {
        // Pattern matching logic
    }
}
```

### Regex Patterns

Based on `docs/src/classification.md`:

1. **GUID/UUID**
   - Pattern: `\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}`
   - Example: `{12345678-1234-1234-1234-123456789abc}`
   - Validation: Format compliance, version field checking

2. **Email Address**
   - Pattern: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
   - Example: `admin@malware.com`
   - Validation: RFC compliance, domain validation

3. **Base64**
   - Pattern: `[A-Za-z0-9+/]{20,}={0,2}`
   - Example: `SGVsbG8gV29ybGQ=`
   - Validation: Length divisibility by 4, padding correctness, minimum length threshold

4. **Printf Format Strings**
   - Pattern: `%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}`
   - Examples: `Error: %s at line %d`, `User {0} logged in`
   - Context: Proximity to other format strings, common in `.rodata`

5. **User Agent**
   - Pattern: `Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+`
   - Example: `Mozilla/5.0 (Windows NT 10.0; Win64; x64)`
   - Validation: Known browser identifiers, version format

## Acceptance Criteria

- [ ] `SemanticClassifier` struct implemented with all five pattern types
- [ ] Each pattern has dedicated detection method with confidence scoring
- [ ] Context-aware classification considers section type and binary format
- [ ] False positive reduction through validation (length, entropy, format)
- [ ] Regex patterns compiled once and cached using `lazy_static`
- [ ] Integration with `FoundString` type to populate `tags` field
- [ ] Comprehensive unit tests for each pattern type:
  - [ ] Positive matches (valid GUIDs, emails, Base64, format strings, user agents)
  - [ ] Negative matches (invalid formats, edge cases)
  - [ ] Context-dependent behavior
  - [ ] Multi-tag scenarios (string matching multiple patterns)
- [ ] Integration tests with real binary samples
- [ ] Documentation with examples and pattern explanations
- [ ] Benchmark tests for performance validation

## Test Coverage Requirements

### Unit Tests (`tests/classification_tests.rs`)

```rust
#[test]
fn test_guid_detection() {
    // Valid GUID formats
    // Invalid GUID formats
    // Case sensitivity
}

#[test]
fn test_email_detection() {
    // Valid emails
    // Invalid emails (missing @, invalid TLD)
}

#[test]
fn test_base64_detection() {
    // Valid Base64 (with/without padding)
    // Invalid Base64 (wrong length, invalid characters)
    // Minimum length threshold
}

#[test]
fn test_format_string_detection() {
    // Printf-style: %s, %d, %x
    // Python-style: {0}, {1}
    // Mixed format strings
}

#[test]
fn test_user_agent_detection() {
    // Common browsers
    // Mobile user agents
    // Bot user agents
}

#[test]
fn test_false_positive_reduction() {
    // High-entropy binary data
    // Very short matches
    // Invalid context
}
```

### Integration Tests

Test with real binaries containing these patterns extracted from:
- ELF binaries with GUIDs in `.rodata`
- PE files with user agents in `.rdata`
- Mach-O binaries with format strings in `__TEXT,__cstring`

## Performance Considerations

- Use `lazy_static` for one-time regex compilation
- Implement short-circuit evaluation (check simpler patterns first)
- Consider minimum string length before applying expensive regex
- Profile regex performance on large binaries

## Related Issues

- Blocks: #36 (Main String Extraction Orchestrator)
- Blocks: #37 (Complete Pipeline Integration)
- Related: #33 (Regex Caching)

## References

- Detailed patterns: `docs/src/classification.md`
- Type definitions: `src/types.rs`
- Tag enum: Lines 20-40 in `src/types.rs`

## Implementation Notes

1. Start with `SemanticClassifier` struct definition
2. Implement each pattern matcher as a separate method
3. Add validation logic for each pattern type
4. Integrate with existing `FoundString` and `Tag` types
5. Write comprehensive unit tests for each pattern
6. Add integration tests with real binaries
7. Optimize with benchmarking
8. Update documentation with usage examples


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Semantic Pattern Matching: GUID, Email, Base64, Printf Format, and User Agent Detection #18

Overview

Current State

Technical Requirements

Dependencies to Add

Implementation Details

Regex Patterns

Acceptance Criteria

Test Coverage Requirements

Unit Tests (`tests/classification_tests.rs`)

Integration Tests

Performance Considerations

Related Issues

References

Implementation Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Implement Semantic Pattern Matching: GUID, Email, Base64, Printf Format, and User Agent Detection #18

Description

Overview

Current State

Technical Requirements

Dependencies to Add

Implementation Details

Regex Patterns

Acceptance Criteria

Test Coverage Requirements

Unit Tests (tests/classification_tests.rs)

Integration Tests

Performance Considerations

Related Issues

References

Implementation Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Unit Tests (`tests/classification_tests.rs`)