Move tier bridge #12

PrzemyslawKlys · 2025-12-26T19:57:33Z

No description provided.

* Introduced a new test method `PsScript_Populates_ScriptLanguage_And_Cmdlets` to validate PowerShell script detection. * Enhanced the `FileInspector` analysis logic to map script languages from file extensions and improve cmdlet extraction for PowerShell scripts. * Updated `SecurityHeuristics` to limit the number of cmdlets extracted and added helper methods for cmdlet identification. * Modified `FileInspectorX.psd1` to include additional metadata for the module.

…abilities * Introduced `FileSystemMetadata` and `FileMetadataOptions` classes to encapsulate file metadata details. * Implemented `ReadFileMetadata` method to read and return metadata for a specified file path. * Added `InspectWithMetadata` method to combine file inspection results with metadata. * Enhanced `CollectMetadata` methods to include file system metadata in analysis results. * Introduced `SignatureStatus` class to summarize signature presence and validity. * Added support for extracting top tokens from script/log content with configurable options. * Updated settings to control top token extraction behavior. * Enhanced analysis and report views to include top tokens in the output.

claude · 2025-12-26T19:59:50Z

PR Review: Move Tier Bridge

I've completed a thorough review of this pull request. Overall, this is a well-structured feature addition that introduces metadata extraction capabilities and enhanced text metrics. Here's my detailed feedback:

✅ Strengths

Well-organized code structure - The use of partial classes (FileInspector.Metadata.cs, FileInspector.TextMetrics.cs) keeps the codebase modular and maintainable.
Comprehensive documentation - All public APIs have clear XML documentation explaining purpose, parameters, and return values.
Good test coverage - The new test PsScript_Populates_ScriptLanguage_And_Cmdlets validates the core functionality of script detection and cmdlet extraction.
Defensive programming - Extensive use of try-catch blocks, null checks, and safe fallbacks (e.g., FileInspector.Metadata.cs:35-38).
Performance considerations - Smart use of ReadOnlySpan<char> for string parsing (FileInspector.TextMetrics.cs:58) and bounded operations with configurable limits.

🐛 Potential Bugs & Issues

1. BOM Handling Issue in `ReadHeadText` (FileInspector.Analyze.cs:2094-2106)

Severity: Medium

The BOM detection logic has an off-by-one error:

if (n >= 3 && buf[0] == 0xEF && buf[1] == 0xBB && buf[2] == 0xBF)
    return System.Text.Encoding.UTF8.GetString(buf, 3, n - 3);

Issue: When the buffer is exactly 3 bytes and contains only the UTF-8 BOM, n - 3 = 0, which returns an empty string. This is technically correct, but the subsequent BOM checks could throw ArgumentOutOfRangeException if buffer length equals BOM length.

Recommendation: Add bounds checking:

if (n >= 3 && buf[0] == 0xEF && buf[1] == 0xBB && buf[2] == 0xBF)
{
    if (n > 3) return System.Text.Encoding.UTF8.GetString(buf, 3, n - 3);
    return string.Empty;
}

2. Cmdlet Extraction Only for PowerShell (FileInspector.Analyze.cs:651-656)

Severity: Low

The code restricts cmdlet extraction to PowerShell only, but SecurityHeuristics.GetCmdlets might be called for non-PowerShell scripts if ScriptLanguage is set but TextSubtype isn't.

Observation: Lines 651-656 use OR condition which could trigger for edge cases where only one field is set.

Recommendation: Consider using AND condition or ensuring both fields are consistently set earlier in the pipeline.

3. Missing Null Check in `TryPopulateTextMetrics` (FileInspector.TextMetrics.cs:18-52)

Severity: Low

Line 22 calls InspectHelpers.IsText(det) without null-checking det first. While IsText likely handles null internally, explicit null-checking here would be defensive.

Recommendation: Add explicit null check if not already handled by IsText.

🔒 Security Considerations

1. Token Extraction from Untrusted Content (FileInspector.TextMetrics.cs:54-91)

Severity: Low

The ExtractTopTokens method processes potentially malicious script/log content. While it's bounded (256KB cap at line 44), extremely dense content with many unique tokens could consume significant memory.

Observation:

The dictionary at line 57 is unbounded until sorting
An attacker could craft input with millions of unique 2-character tokens

Recommendation:

Add a hard limit on dictionary size (e.g., max 10,000 unique tokens)
Consider adding this to Settings as TopTokensMaxUniqueTokens

2. Script Hint Extraction (SecurityHeuristics.Cmdlets.cs:8-58)

Severity: Low

The method processes up to 400 lines (line 19) with pattern matching. While reasonable, this could be exploited with very long lines.

Observation: No per-line length limit is enforced.

Recommendation: Add a per-line character limit check to prevent processing extremely long lines.

⚡ Performance Considerations

1. Repeated String Allocations (FileInspector.TextMetrics.cs:67-73)

Severity: Low

var token = span.Slice(start, len).ToString();
token = TrimToken(token);
if (token.Length < minLen) continue;
token = token.ToLowerInvariant();

Multiple string allocations occur per token, including a potential discard if TrimToken reduces it below minLen.

Recommendation:

Check length before calling ToString()
Consider implementing TrimToken to work on ReadOnlySpan<char> to avoid intermediate allocations

2. TextSubtype Assignment Duplication (FileInspector.Analyze.cs:577-625)

Severity: Low

The code has two nearly identical switch statements:

Lines 577-593: declared extension mapping
Lines 598-617: detected extension mapping

Recommendation: Extract to a shared helper method MapExtensionToTextSubtype(string? ext) to reduce duplication and improve maintainability.

🧪 Test Coverage Assessment

Good:

✅ Basic PowerShell script detection test added
✅ Cmdlet extraction validation
✅ ScriptLanguage field validation

Missing:

❌ No tests for the new metadata APIs (ReadFileMetadata, InspectWithMetadata, CollectMetadata)
❌ No tests for SignatureStatus extraction (GetSignatureStatus)
❌ No tests for top tokens feature (TopTokens, ExtractTopTokens)
❌ No tests for BOM handling in ReadHeadText
❌ No tests for edge cases (empty files, files that don't exist, null parameters)
❌ No tests for the new FileInspectionSummary class

Recommendation: Add tests covering:

[Fact]
public void ReadFileMetadata_NonExistentFile_ReturnsExistsFalse()
public void InspectWithMetadata_ValidFile_ReturnsAllComponents()
public void ExtractTopTokens_WithStopWords_FiltersCorrectly()
public void GetSignatureStatus_UnsignedFile_ReturnsNull()
public void ReadHeadText_WithBOM_StripsCorrectly()

📋 Code Quality & Best Practices

Positives:

✅ Consistent use of nullable reference types
✅ Immutable return types (IReadOnlyList, IReadOnlyDictionary)
✅ Proper use of StringComparer.OrdinalIgnoreCase for case-insensitive operations
✅ Settings-based feature flags (TopTokensEnabled, line 25 in TextMetrics.cs)

Suggestions:

FileInspector.Metadata.cs:32 - MagicHeaderHex method is called but not shown in diff. Ensure it exists and is tested.

SecurityHeuristics.Cmdlets.cs:66 - The line checking char.IsWhiteSpace has a comparison with false which is redundant:

if (start < span.Length && char.IsWhiteSpace(span[start]) == false)

Should be:

if (start < span.Length && !char.IsWhiteSpace(span[start]))

📝 Minor Issues

FileInspector.Analyze.cs:509 - Variable mappedScript could use a more descriptive name like scriptLanguageFromExtension
FileInspector.TextMetrics.cs:8 - The stop words list could be extracted to a configuration or external file for easier maintenance
Settings.cs:326 - TopTokensEnabled defaults to false. Consider documenting why this feature is opt-in.

🎯 Summary & Recommendation

This PR introduces valuable metadata and text analysis features with generally good code quality. The main concerns are:

Test Coverage - Needs significant expansion to cover new APIs
Edge Case Handling - BOM detection and token extraction need hardening
Performance - Minor optimizations possible in hot paths

Overall Assessment: ✅ Approve with suggestions

The code is production-ready with the understanding that:

The BOM handling edge case is rare but should be fixed
Test coverage should be expanded in follow-up commits
Performance optimizations are nice-to-have, not blockers

Great work on maintaining code quality and documentation standards throughout this large addition!

Review completed with focus on C# best practices, security, performance, and maintainability.

Copilot

Pull request overview

This PR introduces several enhancements to the FileInspectorX library focused on improving file analysis capabilities, script detection, and metadata extraction. The changes add token extraction for scripts/logs, comprehensive file system metadata support, enhanced declared vs detected type comparison with security implications, and improved script language detection.

Key changes include:

New TopTokens feature for extracting frequent tokens from script/log content (opt-in via settings)
File system metadata capture (FileSystemMetadata, FileInspectionSummary, SignatureStatus classes)
Enhanced declared type comparison with strong alternative candidates and danger detection (DeclaredTypeComparison)
Improved script language detection from file extensions with expanded coverage
Better UTF BOM handling in text reading functions
Enhanced PowerShell cmdlet extraction with module/function hints

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
Module/FileInspectorX.psd1	Adds RequireLicenseAcceptance field and reformats PSData section
FileInspectorX/Views/ReportView.cs	Adds TopTokens property to report view with heuristics display logic
FileInspectorX/Views/AnalysisView.cs	Adds TopTokens property and updates EstimatedLineCount documentation
FileInspectorX/NuGet.README.md	Documents new features: InspectWithMetadata, TopTokens settings, and CompareDeclaredDetailed
FileInspectorX/Logging/Settings.cs	Adds TopTokens configuration settings (enabled, max, min length, min count)
FileInspectorX/Helpers.cs	Expands IsText helper to include script file extensions
FileInspectorX/Detection/FileInspector.DeclaredComparison.cs	New file implementing detailed declared vs detected type comparison with dangerous alternative detection
FileInspectorX/Detection/DeclaredTypeComparison.cs	New class modeling detailed comparison results with strong alternatives and danger flags
FileInspectorX/Analysis/SignatureStatus.cs	New lightweight signature summary class for simplified signature information access
FileInspectorX/Analysis/SecurityHeuristics.cs	Enhances cmdlet extraction, adds script hint parsing, improves UTF BOM handling
FileInspectorX/Analysis/SecurityHeuristics.Cmdlets.cs	New file with PowerShell-specific module/function/class extraction logic
FileInspectorX/Analysis/FileSystemMetadata.cs	New classes for file system metadata and options
FileInspectorX/Analysis/FileInspector.TextMetrics.cs	New file implementing top token extraction with stop words and frequency counting
FileInspectorX/Analysis/FileInspector.Metadata.cs	New file with metadata reading, signature status, and filename mismatch detection
FileInspectorX/Analysis/FileInspector.Analyze.cs	Improves script detection from extensions, adds TextMetrics population, enhances UTF BOM handling, maps "node" to "javascript"
FileInspectorX/Analysis/FileInspectionSummary.cs	New bundled class combining analysis, report, metadata, and flattened dictionary
FileInspectorX/Analysis/FileAnalysis.cs	Adds TopTokens property with documentation
FileInspectorX.Tests/FileInspectorXHeuristicsTests.cs	Adds test for ScriptLanguage and ScriptCmdlets population

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-26T20:01:06Z

FileInspectorX/Analysis/FileInspector.TextMetrics.cs

+        return token.Substring(start, end - start + 1);
+    }
+
+    private static bool IsTrimChar(char c) => c == '-' || c == '_' || c == '.' || c == ':';


The IsTrimChar function includes colon (':') as a trim character, but this will cause issues with the token format which uses colon as a delimiter. When a token is formatted as "token:count" (line 88), if the original token ends with a colon during extraction, it will be trimmed. However, tokens should not contain colons in the first place based on IsTokenChar (line 95) which doesn't include colon. This inconsistency suggests either IsTokenChar should include colon, or IsTrimChar should not. Given the format "token:count", IsTrimChar should not include colon.

Suggested change

private static bool IsTrimChar(char c) => c == '-' || c == '_' || c == '.' || c == ':';

private static bool IsTrimChar(char c) => c == '-' || c == '_' || c == '.';

Copilot · 2025-12-26T20:01:07Z

FileInspectorX/Analysis/SecurityHeuristics.cs

            foreach (var p in probes)
            {
-                if (text.IndexOf(p, StringComparison.OrdinalIgnoreCase) >= 0) ordered.Add(p);
+                if (text.IndexOf(p, StringComparison.OrdinalIgnoreCase) >= 0) AddCmdlet(p);
+            }


This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Copilot · 2025-12-26T20:01:07Z

FileInspectorX/Detection/FileInspector.DeclaredComparison.cs

+        foreach (var candidate in candidates)
+        {
+            if (!IsStrongCandidate(candidate)) continue;
+            var ext = NormalizeExtension(candidate.Extension);
+            if (!string.IsNullOrEmpty(primaryExt) &&
+                !string.IsNullOrEmpty(ext) &&
+                string.Equals(ext, primaryExt, StringComparison.OrdinalIgnoreCase))
+                continue;
+            list.Add(candidate);
+        }


This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Copilot · 2025-12-26T20:01:07Z

FileInspectorX/Detection/FileInspector.DeclaredComparison.cs

+            foreach (var alt in strong)
+            {
+                var ext = NormalizeExtension(alt.Extension);
+                if (string.IsNullOrEmpty(ext)) continue;
+                if (string.Equals(ext, decl, StringComparison.OrdinalIgnoreCase))
+                {
+                    cmp.DeclaredMatchesAlternative = true;
+                    break;
+                }
+            }


This foreach loop immediately maps its iteration variable to another variable - consider mapping the sequence explicitly using '.Select(...)'.

Copilot · 2025-12-26T20:01:07Z

FileInspectorX/Detection/FileInspector.DeclaredComparison.cs

+        foreach (var candidate in candidates)
+        {
+            var ext = NormalizeExtension(candidate.Extension);
+            if (ext == null || ext.Length == 0) continue;
+            if (!isDangerous(ext)) continue;
+            if (seen.Add(ext)) list.Add(ext);
+            if (list.Count >= 3) break;
+        }


This foreach loop immediately maps its iteration variable to another variable - consider mapping the sequence explicitly using '.Select(...)'.

Copilot · 2025-12-26T20:01:08Z

FileInspectorX/Analysis/SecurityHeuristics.Cmdlets.cs

+    private static string? NormalizeModuleName(string? raw)
+    {
+        if (string.IsNullOrWhiteSpace(raw)) return null;
+        var name = (raw ?? string.Empty).Trim().Trim('"', '\'');


Condition is always not null because of call to method IsNullOrWhiteSpace.

Copilot · 2025-12-26T20:01:08Z

FileInspectorX/Analysis/FileInspector.Analyze.cs

                }
                // JS minified heuristic if file extension is .js
                var declaredExt = System.IO.Path.GetExtension(path)?.TrimStart('.').ToLowerInvariant();
+                var detectedExt = (det?.Extension ?? string.Empty).Trim().TrimStart('.').ToLowerInvariant();


Condition is always not null because of ... is ....

Suggested change

var detectedExt = (det?.Extension ?? string.Empty).Trim().TrimStart('.').ToLowerInvariant();

var detectedExt = det?.Extension?.Trim().TrimStart('.').ToLowerInvariant() ?? string.Empty;

Copilot · 2025-12-26T20:01:08Z

FileInspectorX/Analysis/FileInspector.Analyze.cs


            // PDF heuristics
-            if (det.Extension == "pdf") {
+            if (det != null && det.Extension == "pdf") {


Condition is always true because of ... is ....

Suggested change

if (det != null && det.Extension == "pdf") {

if (det.Extension == "pdf") {

Copilot · 2025-12-26T20:01:09Z

FileInspectorX/Analysis/FileInspector.Analyze.cs


            // OLE2 Office macros (VBA) check for legacy formats (.doc/.xls/.ppt)
-            if (det.Extension is "doc" or "xls" or "ppt")
+            if (det != null && det.Extension is "doc" or "xls" or "ppt")


Condition is always true because of ... is ....

Suggested change

if (det != null && det.Extension is "doc" or "xls" or "ppt")

if (det != null && (det.Extension == "doc" || det.Extension == "xls" || det.Extension == "ppt"))

FileInspectorX/Analysis/SecurityHeuristics.Cmdlets.cs

* Introduced `MapTextSubtypeFromExtension` method for improved mapping of file extensions to text subtypes. * Refactored token extraction logic to support a maximum number of unique tokens. * Added settings for `TopTokensMaxUniqueTokens` and `ScriptHintMaxLineLength` to enhance configurability. * Improved handling of script hints by enforcing line length constraints.

claude · 2025-12-26T20:17:52Z

Pull Request Review: Move tier bridge

Summary

This PR introduces significant enhancements to FileInspectorX, adding file system metadata reading capabilities, improved PowerShell script detection, text subtype mapping enhancements, and top token extraction for script/log content analysis.

Overall Assessment: Strong implementation with minor concerns ✅

Code Quality & Best Practices

Strengths ⭐

Excellent code organization: New functionality is cleanly separated into partial classes (FileInspector.Metadata.cs, FileInspector.TextMetrics.cs, FileInspector.TextSubtype.cs)
Good use of modern C# features: ReadOnlySpan<char> for efficient string processing, nullable reference types, readonly collections
Comprehensive XML documentation: All public APIs are well-documented
Defensive programming: Extensive try-catch blocks and null checks throughout
Configurable behavior: New features controlled via Settings class with sensible defaults

Suggestions for Improvement 📝

FileInspector.Analyze.cs:2036-2081 - BOM handling in ReadHeadText:
- The BOM detection logic is excellent and handles UTF-8, UTF-16, UTF-32 variants
- However, the original code path (without BOM) doesn't verify the text is valid UTF-8
- Consider adding UTF-8 validation or using Encoding.UTF8.GetString(buf, 0, n) with a fallback to Latin-1/ASCII if decoding fails
SecurityHeuristics.Cmdlets.cs:110-121 - IndexOfToken method:
- The boundary check logic is solid, but could benefit from a comment explaining the word-boundary requirement
- Consider extracting magic number 400 (line 19) to a constant or Settings property
FileInspector.TextMetrics.cs:65 - Token limiting behavior:
```
if (maxUnique > 0 && counts.Count >= maxUnique && !counts.ContainsKey(token)) continue;
```
- This silently drops new tokens once the limit is reached
- Consider documenting this behavior in the XML comments for TopTokensMaxUniqueTokens setting

Potential Bugs & Issues

Critical Issues 🔴

None identified.

Medium Priority Issues 🟡

FileInspector.Analyze.cs:510-516 - Script language mapping logic:
```
var detectedExt = (det?.Extension ?? string.Empty).Trim().TrimStart('.').ToLowerInvariant();
var mappedScript = MapScriptLanguageFromExtension(detectedExt) ?? MapScriptLanguageFromExtension(declaredExt);
```
- If det is null, this will map empty string twice, which is harmless but inefficient
- Consider: if (det == null) continue; early exit or null-coalesce the extensions first
FileInspector.TextMetrics.cs:37 - Magic number hardcoded:
```
int cap = Math.Min(Settings.DetectionReadBudgetBytes, 256 * 1024);
```
- The 256KB cap is hardcoded but should arguably be a Setting
- This could be surprising if someone increases DetectionReadBudgetBytes expecting top token extraction to use more data
SecurityHeuristics.Cmdlets.cs:77 - Potential StringComparison bug:
```
int nameIdx = lower.IndexOf("-name", StringComparison.Ordinal);
if (nameIdx < 0) nameIdx = lower.IndexOf("-modulename", StringComparison.Ordinal);
```
- The variable lower is already lowercased, so StringComparison.Ordinal is correct
- However, the check on line 77 for -modulename prefix uses StartsWith on the original case-sensitive span
- Recommendation: Use OrdinalIgnoreCase for consistency or convert span slice to lowercase

Low Priority Issues 🟢

FileInspector.Metadata.cs:148-151 - Signature metadata overwrite:
- The AddSignatureMetadata method could potentially overwrite keys from the report dictionary
- Consider checking for key existence or documenting the precedence order

Performance Considerations

Positive Aspects ⚡

Efficient string processing: Excellent use of Span<char> and ReadOnlySpan<char> to avoid allocations
Bounded operations: All file reads are capped by configurable budget limits
Early exits: Good use of early returns when features are disabled
Dictionary lookups: Appropriate use of StringComparer.OrdinalIgnoreCase for case-insensitive operations

Potential Concerns ⚠️

FileInspector.TextMetrics.cs:47-68 - Token extraction dictionary growth:
- The maxUnique limit prevents unbounded growth, which is good
- However, for large files (256KB), this could still create significant GC pressure
- Consider: Pre-sizing the dictionary with initial capacity when maxUnique > 0
```
var counts = maxUnique > 0 
    ? new Dictionary<string, int>(maxUnique, StringComparer.OrdinalIgnoreCase)
    : new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
```
FileInspector.Analyze.cs:628-633 - Redundant string comparisons:
- The PowerShell detection uses two separate string comparisons
- Consider combining: string.Equals(res.ScriptLanguage ?? res.TextSubtype, "powershell", ...)

Security Concerns

Overall Security Posture: Strong 🔒

No injection vulnerabilities identified: File path handling appears safe
Resource limits properly enforced: All read operations bounded by settings
No sensitive data exposure: Metadata collection respects opt-in flags

Minor Observations

FileSystemMetadata.cs:23 - MagicHeaderHex exposure:
- Exposing raw file headers could theoretically leak sensitive data if users enable this on confidential files
- Current design (opt-in via IncludeMagicHeader flag) is appropriate
- Consider: Adding a note in documentation about potential sensitivity
Settings.cs:326 - TopTokensEnabled default is false:
- Good security posture to have this opt-in
- Top tokens could theoretically expose sensitive keywords from log files

Test Coverage

Current Coverage: Adequate ✅

New test added:

PsScript_Populates_ScriptLanguage_And_Cmdlets - Tests PowerShell cmdlet extraction

Recommendations for Additional Tests 🧪

Metadata functionality:
- Test ReadFileMetadata with non-existent file
- Test InspectWithMetadata with various FileMetadataOptions combinations
- Test IsOriginalFilenameMismatch with edge cases (null, empty, path vs filename)
Text subtype mapping:
- Test fallback behavior when declared and detected extensions differ
- Test script language backfilling from TextSubtype
- Test with files having no extension
Top token extraction:
- Test with TopTokensEnabled = true
- Test boundary conditions (empty file, file with only stop words, file with only numbers)
- Test maxUnique limiting behavior
Edge cases:
- BOM handling in ReadHeadText for various encodings
- Very long lines exceeding ScriptHintMaxLineLength
- Files exactly at buffer boundaries

Suggested test additions:

[Xunit.Fact]
public void ReadFileMetadata_NonExistentFile_ReturnsExistsFalse()
{
    var meta = FileInspector.ReadFileMetadata("/nonexistent/path.txt");
    Xunit.Assert.NotNull(meta);
    Xunit.Assert.False(meta!.Exists);
}

[Xunit.Fact]
public void ExtractTopTokens_WithMaxUniqueLimit_RespectsLimit()
{
    // Test that when maxUniqueTokens is reached, new tokens are ignored
}

[Xunit.Fact] 
public void MapTextSubtypeFromExtension_HandlesVariousExtensions()
{
    // Test .ps1, .py, .sh, .json, .yaml, etc.
}

Additional Observations

Documentation 📚

Module manifest update (FileInspectorX.psd1): Version metadata updated appropriately
NuGet README: Release notes section added, which is helpful for users

Commit Messages 📝

The commit messages follow conventional commit format with emojis, which is good:

✨ feat(tests): add PowerShell script detection tests
✨ feat(metadata): Add file system metadata reading
✨ feat(analysis): Enhance text subtype mapping

Breaking Changes

No breaking changes identified - all new functionality is additive.

Recommendations Summary

Must Address Before Merge

None - code is production-ready as-is.

Should Consider

Add UTF-8 validation fallback in ReadHeadText
Extract magic numbers to Settings (256KB cap, 400 line limit)
Pre-size dictionaries when maxUnique is known
Add recommended test cases for better coverage

Nice to Have

Add XML doc comments explaining token dropping behavior with maxUnique
Document signature metadata key precedence in CollectMetadata
Add inline comments for word-boundary logic in IndexOfToken

Conclusion

This is a high-quality PR that adds valuable functionality to FileInspectorX. The code demonstrates strong engineering practices with proper separation of concerns, defensive programming, and performance optimization. The additions are well-integrated with the existing codebase and maintain consistency with established patterns.

Recommendation: APPROVE with minor suggestions ✅

The identified issues are all minor and don't block merging. Consider addressing the test coverage gaps and performance optimizations in a follow-up PR if desired.

Great work! 🎉

…n scanning limit - Introduced `TopTokensMaxBytes` to allow users to set a maximum byte limit for scanning when computing top tokens. - Default value set to 256 KB, with an option to fall back to `DetectionReadBudgetBytes` if set to 0. - Enhanced script language detection logic for improved accuracy.

claude · 2025-12-27T12:16:19Z

Code Review for PR #12: Move tier bridge

I have completed a comprehensive review of this pull request. Overall, this is a well-structured enhancement that adds significant new functionality while maintaining code quality.

=== STRENGTHS ===

Well-organized code structure with new partial class files following existing patterns
Good separation of concerns for metadata extraction, text metrics, and type detection
Defensive programming with null checks, try-catch blocks, and boundary validations
Comprehensive XML documentation consistent with the codebase
New test added for PowerShell cmdlet detection

=== POTENTIAL ISSUES ===

FileInspector.TextMetrics.cs:91-92 - Token character validation includes - and . which could lead to IP addresses and file names being treated as single tokens. The TryTrimToken method suggests these might not always be desirable as part of tokens.
FileInspector.Metadata.cs:140-151 - Potential key collision when merging report dictionary. Direct assignment could silently overwrite existing keys. AddSignatureMetadata checks with ContainsKey, but report merge does not.
FileInspector.Analyze.cs:779 - Comment is outdated, should be updated to reflect that TryPopulateTextMetrics handles more than just CSV/TSV.

=== PERFORMANCE ===

Excellent use of ReadOnlySpan for token extraction
BOM detection ordered from most common to least common - optimal
TopTokensEnabled defaults to false - good for opt-in performance impact
Dictionary capacity limiting prevents memory exhaustion

=== SECURITY ===

Proper path handling with System.IO.Path methods
Good resource limits prevent DoS via exhaustion
Cmdlet extraction properly restricted to PowerShell only
No injection vulnerabilities observed

=== TEST COVERAGE RECOMMENDATIONS ===

Token extraction edge cases (empty files, stop words only, boundary conditions)
Metadata collection with missing files and key conflicts
Text subtype mapping fallback logic
BOM handling with various encodings

=== RECOMMENDATION ===

APPROVE with minor suggestions

The issues identified are relatively minor and do not block merging. Main improvements:

Clarify token character definition
Add tests for new functionality
Document metadata key override behavior
Update outdated comment

The code quality is high with good defensive programming practices and performance considerations.

Review generated with Claude Code

…settings * Updated XML comments to indicate potential exposure of sensitive content for `MagicHeaderHex` and `IncludeMagicHeader`. * Introduced `ScriptHintMaxLines` setting to control the maximum number of lines scanned for script hints, with a default of 400. * Added note in `TopTokensEnabled` to highlight the risk of surfacing sensitive keywords.

* Updated version numbers in project files to reflect the new release. * Ensured consistency across `Update-Version.ps1`, `.csproj`, and module manifest files.

claude · 2025-12-27T12:25:03Z

Pull Request Review: Move tier bridge (PR #12)

Summary

This PR introduces several significant enhancements to FileInspectorX, including metadata extraction, text metrics with token analysis, improved script detection, and enhanced declared vs detected type comparison. The changes span 19 files with 1027 additions and 40 deletions.

Code Quality and Best Practices

Strengths ✅

Well-structured code organization: The new partial class files (FileInspector.Metadata.cs, FileInspector.TextMetrics.cs, FileInspector.TextSubtype.cs) follow good separation of concerns.
Comprehensive XML documentation: All public APIs have clear documentation explaining parameters, return values, and behavior.
Proper use of nullable reference types: Consistent use of ? for nullable types throughout the codebase.
Performance-conscious: Use of ReadOnlySpan<char> in text processing (lines 55-115 in FileInspector.TextMetrics.cs) shows good performance awareness.
Defensive programming: Extensive null checks and try-catch blocks prevent exceptions from propagating.

Areas for Improvement 🔍

Silent exception handling (multiple locations):

catch { }  // Lines 46, 60, 86, 109, 166 in SecurityHeuristics.Cmdlets.cs

Issue: Swallowing exceptions makes debugging difficult.
Recommendation: At minimum, log exceptions when verbose logging is enabled:

catch (Exception ex) 
{ 
    if (Settings.Logger.IsDebug) 
        Settings.Logger.LogDebug($"Token extraction failed: {ex.Message}"); 
}

BOM handling enhancement (FileInspector.Analyze.cs:2040-2049):
The new BOM detection in ReadHeadText is excellent, but consider extracting it to a helper method for reusability:
```
private static (int bomLength, Encoding encoding) DetectBOM(byte[] buf, int length) { ... }
```
Magic number literals (FileInspector.DeclaredComparison.cs:7):
```
private const int StrongCandidateScoreThreshold = 80;
```
Good: This is properly defined as a constant.
Suggestion: Consider making it configurable via Settings if users need to adjust sensitivity.
Duplicate logic reduction (FileInspector.Analyze.cs:505-540):
The script language mapping appears in multiple places. The new MapScriptLanguageFromExtension helper is great, but ensure it's used consistently everywhere.

Potential Bugs or Issues

Critical 🔴

None identified. The code appears solid.

Medium Priority 🟡

Potential null reference (FileInspector.Analyze.cs:631-637):

var psLang = res.ScriptLanguage ?? res.TextSubtype;
if (string.Equals(psLang, "powershell", StringComparison.OrdinalIgnoreCase))
{
    var cmdlets = SecurityHeuristics.GetCmdlets(path, Settings.DetectionReadBudgetBytes);

Issue: If both ScriptLanguage and TextSubtype are null, psLang will be null but the comparison will still work (returns false). However, for clarity, consider:

if (!string.IsNullOrEmpty(res.ScriptLanguage ?? res.TextSubtype) && 
    string.Equals(psLang, "powershell", StringComparison.OrdinalIgnoreCase))

Dictionary key collision handling (FileInspector.Metadata.cs:128-129):
```
foreach (var kv in reportDict) dict[kv.Key] = kv.Value;
```
Issue: Silently overwrites existing keys. While this may be intentional, consider documenting the precedence order.
Token extraction memory bounds (FileInspector.TextMetrics.cs:52-54):
```
var counts = maxUnique > 0
    ? new Dictionary<string, int>(maxUnique, StringComparer.OrdinalIgnoreCase)
    : new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
```
Observation: When maxUnique is 0, the dictionary is unbounded. The check on line 69 protects against unbounded growth, but the initial capacity hint could be set to a reasonable default (e.g., 256) for better memory allocation.

Performance Considerations

Positive Aspects ✅

Span-based string processing: Excellent use of ReadOnlySpan<char> in ExtractTopTokens (lines 55-72) avoids string allocations.
Early returns: Good use of guard clauses to avoid unnecessary work (e.g., lines 22-33 in TryPopulateTextMetrics).
Bounded operations: All loops and reads have configurable limits to prevent runaway resource consumption.
Lazy evaluation: Metadata collection only happens when requested via InspectWithMetadata.

Potential Optimizations 💡

String.Contains performance (FileInspector.TextMetrics.cs:68):
```
if (TopTokenStopWords.Contains(token)) continue;
```
Current: HashSet lookup - already optimal. ✅
List sorting (FileInspector.TextMetrics.cs:75-80):
Consider using a min-heap (PriorityQueue in .NET 6+) if TopTokensMax is small relative to unique token count. Current approach is fine for typical use cases.
File I/O optimization (FileInspector.Metadata.cs:32):
```
meta.MagicHeaderHex = MagicHeaderHex(path, options.MagicHeaderBytes);
```
Recommendation: If InspectWithMetadata is called after Inspect, the file is read twice. Consider passing the already-read header data if available.

Security Concerns

Good Security Practices ✅

Input validation: Consistent checks for null/empty strings before processing.
Bounded reads: All file reads are limited by DetectionReadBudgetBytes and other configurable limits.
No code execution: All analysis is static; no dynamic code evaluation.
Safe encoding detection: BOM handling prevents encoding attacks.

Security Notes 🔐

TopTokens feature (Settings.cs:322-326):

/// Note: enabling this can surface sensitive keywords from scripts/logs.
public static bool TopTokensEnabled { get; set; } = false;

Good: Default is false and documentation warns about sensitive data exposure. ✅

MagicHeaderHex exposure (FileSystemMetadata.cs:23):

/// Magic header bytes as hex, when requested. Note: may expose sensitive content.
public string? MagicHeaderHex { get; set; }

Good: Documentation warns about potential sensitive data. ✅

Stop word list (FileInspector.TextMetrics.cs:8-16):
The stop word list is comprehensive and prevents common noise. Consider adding common credential-related terms to avoid leaking sensitive tokens:
```
"password", "secret", "apikey", "token", "credential"
```

Test Coverage

Existing Tests ✅

New test added (FileInspectorXHeuristicsTests.cs:33-50):
```
[Xunit.Fact]
public void PsScript_Populates_ScriptLanguage_And_Cmdlets()
```
Good: Tests the new cmdlet extraction feature with specific assertions.

Test Coverage Gaps 🧪

Missing tests for new metadata APIs:
- ReadFileMetadata with various options
- InspectWithMetadata integration
- CollectMetadata dictionary merging behavior
- GetSignatureStatus edge cases
- IsOriginalFilenameMismatch with various inputs
Missing tests for TopTokens feature:
- Token extraction with various text types
- Stop word filtering
- Token counting and sorting
- Edge cases (empty file, all stop words, unicode)
Missing tests for new script language mapping:
- MapScriptLanguageFromExtension for all supported languages
- IsScriptTextSubtype for boundary cases
Missing tests for DeclaredTypeComparison enhancements:
- Strong alternatives detection
- Dangerous alternative handling
- Plain text family bypass detection

Recommendation: Add comprehensive unit tests for all new public APIs before merging.

Specific File-by-File Comments

FileInspector.TextMetrics.cs

Line 114: IsTrimChar could include more edge cases like ',' if tokens can be comma-separated.
Lines 8-16: Consider making TopTokenStopWords configurable via Settings for advanced users.

FileInspector.Metadata.cs

Line 17: When file doesn't exist, consider whether returning an object with Exists = false is better than returning null. Current approach is fine, but document the contract clearly.
Lines 150-154: The !dict.ContainsKey checks are good defensive programming.

SecurityHeuristics.Cmdlets.cs

Line 156: Consider using Path.DirectorySeparatorChar instead of hardcoding '\\' and '/' for better cross-platform support.
Line 118-121: The word boundary enforcement in IndexOfToken is excellent for avoiding false positives.

FileInspector.DeclaredComparison.cs

Lines 78-82: The plain text bypass detection for dangerous alternatives is a clever security enhancement.
Line 75: The AppendReason helper is called but not shown - ensure it handles null/empty strings correctly.

Settings.cs

Lines 322-348: The new TopTokens settings are well-documented with sensible defaults.
Line 327: Default of false for TopTokensEnabled is correct given security implications.

Recommendations

Before Merging

✅ Add comprehensive unit tests for new APIs
✅ Improve exception handling (log instead of silent catch)
✅ Add credential-related terms to stop word list
✅ Document dictionary key precedence in CollectMetadata

Future Enhancements

Consider extracting BOM detection to reusable helper
Make StrongCandidateScoreThreshold configurable
Consider API to pass pre-read file data to avoid duplicate I/O
Add integration tests with real-world files

Conclusion

This is a high-quality PR that significantly enhances FileInspectorX's capabilities. The code follows good practices, is well-documented, and introduces useful features. The main areas for improvement are:

Test coverage for new features
Exception handling (logging vs silent suppression)
Minor documentation clarifications

Overall Assessment: Approve with minor suggestions ✅

The code is production-ready after addressing test coverage gaps. Excellent work on the metadata extraction and text analysis features!

Review generated with Claude Code

* Corrected the formatting of the module manifest. * Ensured proper syntax for PowerShell module compatibility.

* Introduced new properties in `FileAnalysis` for caching detection results: - `DetectedExtension` - `DetectedMimeType` - `DetectionConfidence` - `DetectionReason` - `DetectionValidationStatus` - `DetectionScore` - `DetectionIsDangerous` * Implemented `PopulateDetectionSummary` method to populate these properties based on detection results. * Enhanced PE kind detection logic in `FileInspector`.

claude · 2025-12-27T14:48:16Z

Pull Request Review: v1.0.6 Feature Enhancements

Summary

This PR introduces significant enhancements to FileInspectorX v1.0.6, including file metadata reading capabilities, improved script detection, top token extraction, and various detection improvements. The changes span 24 files with 1,078 additions and 47 deletions.

Code Quality & Best Practices ✅

Strengths

Good modularization: New functionality properly separated into partial classes (FileInspector.Metadata.cs, FileInspector.TextMetrics.cs, FileInspector.TextSubtype.cs, etc.)
Comprehensive XML documentation: All public APIs are well-documented with clear descriptions
Appropriate use of nullable reference types: Consistent use of ? annotations throughout
Clean separation of concerns: Detection summary, metadata, and text metrics are logically separated
Thread-safe considerations: Good use of StringComparer.OrdinalIgnoreCase for case-insensitive dictionaries

Areas for Improvement

1. BOM Handling in ReadHeadText (FileInspector.Analyze.cs:2084-2094)

The new BOM detection logic is excellent, but consider extracting it into a separate helper method for reusability:

private static string ReadHeadTextWithBomDetection(byte[] buffer, int bytesRead)

2. Magic Number Constants (FileInspector.Analyze.cs:898-900)

The PE characteristics constant should be defined at the class level:

private const ushort IMAGE_FILE_DLL = 0x2000;

Currently it's defined inline which reduces discoverability.

3. Error Handling Granularity

Several methods use empty catch blocks (catch { }). While acceptable for optional enrichment features, consider logging when Settings.Debug is enabled for troubleshooting:

FileInspector.TextMetrics.cs:46
FileInspector.Metadata.cs:35-38
SecurityHeuristics.Cmdlets.cs:60

Potential Bugs & Issues ⚠️

Critical Issues

1. Dictionary Key Collision in CollectMetadata (FileInspector.Metadata.cs:128-131)

var reportDict = (report ?? ReportView.From(analysis)).ToDictionary();
foreach (var kv in reportDict) dict[kv.Key] = kv.Value;

This silently overwrites file metadata if there are duplicate keys. Consider using TryAdd with a warning or prefixing keys to avoid collisions.

2. Truncated Diff in PR

The diff appears truncated at line 82 in FileInspector.TextMetrics.cs:

var result = new List<string>(Math.Min(max, list.C

This suggests the file may be incomplete. Please verify the ExtractTopTokens method completes properly.

Medium Priority

3. Potential Null Reference (FileInspector.Analyze.cs:778)

if (det != null && det.Extension == "pdf") {

Later in the same method (line 809), det is used without null check:

if (det != null && det.Extension is "doc" or "xls" or "ppt")

While this is safe given the control flow, the earlier null-check at line 743 (if (det == null) det = ...) could be clearer about invariants.

4. Unbounded Token Dictionary Growth (FileInspector.TextMetrics.cs:52-71)

var counts = maxUnique > 0
    ? new Dictionary<string, int>(maxUnique, StringComparer.OrdinalIgnoreCase)
    : new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);

When maxUnique is 0, the dictionary can grow without bounds for large files. While bounded by TopTokensMaxBytes, consider enforcing a reasonable default (e.g., 100K) to prevent memory issues.

Performance Considerations 🚀

Positive Aspects

Configurable budgets: Good use of TopTokensMaxBytes, ScriptHintMaxLines, and ScriptHintMaxLineLength to bound operations
Span usage: Efficient text processing with ReadOnlySpan<char> in token extraction (FileInspector.TextMetrics.cs:55)
Early exits: Proper guard clauses prevent unnecessary processing
BOM skip: Proper handling to avoid processing BOM bytes as content

Recommendations

1. Token Extraction Optimization

Consider using stackalloc for small token buffers to reduce allocations:

Span<char> tokenBuffer = token.Length <= 256 
    ? stackalloc char[token.Length] 
    : new char[token.Length];

2. String Allocation in Hot Paths (FileInspector.TextMetrics.cs:67)

var token = trimmed.ToString().ToLowerInvariant();

This allocates two strings per token. Consider using MemoryExtensions.Equals with case-insensitive comparison for stop word checking before allocating.

3. Multiple Extension Normalizations

Extension normalization happens in multiple places. Consider a single NormalizeExtension helper to ensure consistency and reduce allocations.

Security Concerns 🔒

Positive Aspects

Excellent security awareness: Documentation clearly warns about sensitive content exposure (Settings.cs:325, FileSystemMetadata.cs:22)
Bounded operations: All file reads are capped with budget limits
Stop word filtering: Prevents some sensitive data from appearing in top tokens
Default-off for risky features: TopTokensEnabled defaults to false

Concerns

1. TopTokens Privacy Risk (Settings.cs:323-327)

While documented, this feature can expose:

Passwords/secrets in scripts
User/system names
Internal hostnames
API keys

Recommendation: Consider adding a TopTokensRedactPatterns setting to automatically filter known sensitive patterns (e.g., password=, apikey=, email patterns).

2. MagicHeaderHex Exposure (FileSystemMetadata.cs:22-23)

Binary file headers can contain sensitive data (e.g., embedded credentials in config files). Consider:

Reducing default from 64 to 16 bytes (sufficient for most magic numbers)
Adding a warning in documentation that this shouldn't be logged/transmitted in production

3. Path Disclosure (FileMetadataOptions.cs:32)

IncludePath defaults to true, which may leak internal file system structure. Consider defaulting to false or documenting this risk.

Test Coverage ✅

Strengths

Good test coverage for new features: New test PsScript_Populates_ScriptLanguage_And_Cmdlets validates the core functionality
Cross-platform test helpers: WriteTemp and TryDelete are well-designed
Edge case testing: Existing tests cover ambiguous extensions (.config, .conf)

Gaps

1. Missing Tests for New Features

No tests for FileInspectionSummary / InspectWithMetadata
No tests for TopTokens extraction
No tests for FileSystemMetadata reading
No tests for SignatureStatus
No tests for new PeKind detection
No tests for DetectionSummary population

2. Missing Edge Case Tests

Empty files
Files with only BOM
Unicode tokens in ExtractTopTokens
Files exceeding TopTokensMaxBytes
Null/empty extension handling in new mapping methods

3. Missing Security Tests

Test that sensitive patterns are filtered from TopTokens
Test that ScriptHintMaxLines is enforced
Test metadata options privacy flags

Additional Observations

1. Settings Growth

7 new settings added (TopTokensEnabled, TopTokensMaxBytes, TopTokensMax, TopTokensMinLength, TopTokensMinCount, TopTokensMaxUniqueTokens, ScriptHintMaxLines, ScriptHintMaxLineLength). While each is justified, consider if some could be grouped or if defaults cover 90% of use cases to reduce configuration complexity.

2. Breaking Changes

The PR appears backward compatible - all new features are opt-in or additive. Good job!

3. Documentation Quality

XML documentation is excellent throughout. Consider adding example usage in the NuGet README for the new InspectWithMetadata API.

Recommendations Summary

Must Fix 🔴

Verify and fix the truncated diff in FileInspector.TextMetrics.cs
Address dictionary key collision issue in CollectMetadata

Should Fix 🟡

Add unit tests for new public APIs
Extract BOM detection into a helper method
Add security tests for TopTokens
Reduce default MagicHeaderBytes from 64 to 16
Consider IncludePath default to false

Nice to Have 🟢

Add debug logging to empty catch blocks
Define PE constants at class level
Optimize token extraction with stackalloc
Add TopTokensRedactPatterns feature
Add more edge case tests

Conclusion

This is a well-engineered PR that adds valuable functionality while maintaining code quality. The modularization, documentation, and security awareness are commendable. The main concerns are missing test coverage for new features and a potential dictionary collision bug. Once the critical issues are addressed and test coverage is improved, this will be ready to merge.

Overall Assessment: Approve with minor changes requested ✅

Great work on this release! 🎉

…routines * Enhanced exception handling to log specific errors in `SecurityHeuristics.Cmdlets.cs`, `TextMetrics.cs`, and `Metadata.cs`. * Moved `IMAGE_FILE_DLL` constant to a new file `PeConstants.cs` for better organization. * Added a hard limit for unique tokens in `TextMetrics.cs` to prevent excessive memory usage. * Introduced a new method `ReadHeadText` in `TextReading.cs` for improved text reading with BOM detection.

claude · 2025-12-27T15:12:39Z

Pull Request Review: Move tier bridge (v1.0.6)

Summary

This PR adds significant functionality to FileInspectorX with new metadata collection APIs, text metrics extraction, and improved script detection. The changes introduce ~1,131 additions across 26 files with good test coverage.

Code Quality & Best Practices

Strengths

Well-structured partial classes: Excellent use of partial classes to organize the growing FileInspector class into logical units (Metadata, TextMetrics, TextReading, TextSubtype, DetectionSummary, DeclaredComparison).
Strong API design: New APIs like InspectWithMetadata() and ReadFileMetadata() follow clear naming conventions and provide good separation of concerns.
Comprehensive settings: New settings in Settings.cs are well-documented with XML comments and sensible defaults.
Test coverage: Good addition of test cases, particularly PsScript_Populates_ScriptLanguage_And_Cmdlets().

Areas for Improvement

1. Missing Null Check in FileInspector.TextMetrics.cs:85

list.Sort((a, b) =>
{
    int c = b.Value.CompareTo(a.Value);
    return c != 0 ? c : string.Compare(a.Key, b.Key, StringComparison.OrdinalIgnoreCase);
});

The diff appears truncated, but the string.Compare should be verified for null safety since dictionary keys shouldn't be null but defensive coding is recommended.

2. Performance Consideration: Token Extraction

FileInspector.TextMetrics.cs:55-96 - The ExtractTopTokens method processes text character-by-character using ReadOnlySpan<char>. While efficient, consider:

The hard limit of TopTokensHardMaxUniqueTokens = 100_000 could consume significant memory for large files
Current implementation drops new tokens after hitting the limit (line 75), which may miss important tokens appearing later in the file
Suggestion: Document this behavior clearly or consider a reservoir sampling approach for more representative token distribution

3. Exception Swallowing

Multiple locations use empty catch blocks or only log in debug mode:

FileInspector.Metadata.cs:35-40 - catches all exceptions, only logs in debug
FileInspector.TextReading.cs:19-22 - returns empty string on any exception
SecurityHeuristics.Cmdlets.cs:60-64 - generic exception handling

Recommendation: While acceptable for optional/best-effort operations, consider logging at least the exception type at warning level for production debugging.

4. Magic Number in FileInspector.DeclaredComparison.cs:7

private const int StrongCandidateScoreThreshold = 80;

This threshold is hardcoded but would benefit from being configurable in Settings.cs for consistency with other detection thresholds.

5. Incomplete Refactoring in FileInspector.Analyze.cs

Lines 508-523 show good refactoring with new helper methods MapScriptLanguageFromExtension() and MapTextSubtypeFromExtension(), but the logic is complex with multiple fallback paths:

var mappedScript = MapScriptLanguageFromExtension(detectedExt);
if (mappedScript == null && !string.IsNullOrWhiteSpace(declaredExt))
    mappedScript = MapScriptLanguageFromExtension(declaredExt);
// ... followed by more conditional backfilling

This could be simplified with a clearer priority order to reduce cognitive load.

Potential Bugs

1. Possible NullReferenceException in FileInspector.Analyze.cs:778

if (det != null && det.Extension == "pdf") {

The code checks det != null but earlier in the method (visible in full context), det is dereferenced without null checks. The diff shows defensive null checks added in some places but not consistently.

2. Dictionary Key Collision Handling

FileInspector.Metadata.cs:131-136:

foreach (var kv in reportDict)
{
    if (!dict.ContainsKey(kv.Key))
        dict[kv.Key] = kv.Value;
    else if (Settings.Logger.IsDebug)
        Settings.Logger.WriteDebug("metadata:collision key '{0}' skipped", kv.Key);
}

Silent collision resolution (first wins) could hide bugs. Consider:

Warning level logging instead of debug
Or throwing an exception in debug builds
Or documenting the collision resolution strategy in XML comments

3. File Extension Comparison Case Sensitivity

Multiple locations use ToLowerInvariant() for extension comparison, which is good. However, verify that all file system operations handle case-insensitive file systems correctly (Windows) vs case-sensitive (Linux).

Security Concerns

1. File Path Inclusion in Metadata

FileSystemMetadata.Path can expose sensitive file paths in logs/telemetry:

if (options.IncludePath) meta.Path = path;

Good: This is opt-in with FileMetadataOptions.IncludePath (default true).
Recommendation: Document the privacy implications in the XML comment for IncludePath setting.

2. Magic Header Hex Exposure

FileSystemMetadata.cs:43:

/// <summary>Magic header bytes as hex, when requested. Note: may expose sensitive content.</summary>
public string? MagicHeaderHex { get; set; }

Good: The XML comment warns about sensitive content exposure.
Concern: Default is IncludeMagicHeader = true with 64 bytes. This could leak sensitive data (encryption keys, credentials) from file headers.
Recommendation: Consider defaulting to false or reducing default MagicHeaderBytes to 16 (sufficient for most file type detection).

3. Top Tokens May Leak Sensitive Information

Settings.cs:326:

/// Note: enabling this can surface sensitive keywords from scripts/logs.
public static bool TopTokensEnabled { get; set; } = false;

Good: Disabled by default with clear warning.
Good: Stop words list excludes common keywords.
Concern: Token extraction doesn't filter sensitive patterns (passwords, API keys, tokens).
Recommendation: Add pattern-based filtering to exclude tokens matching password, apikey, token, secret, etc.

4. Command Injection Risk - Low

SecurityHeuristics.Cmdlets.cs extracts module names and function names from PowerShell scripts. The extraction appears safe (no execution), but verify downstream consumers don't execute or eval these strings.

Performance Considerations

1. Multiple File Reads

The analysis flow may read the same file multiple times:

Initial detection reads header
ReadHeadText() reads up to DetectionReadBudgetBytes (1MB default)
Token extraction may re-read
Cmdlet extraction may re-read

Recommendation: Consider caching the initial read buffer for reuse across heuristics.

2. String Allocations in Token Processing

FileInspector.TextMetrics.cs:73:

var token = trimmed.ToString().ToLowerInvariant();

Each token creates two string allocations. For large files with many tokens, this could pressure GC.
Suggestion: Use StringComparer.OrdinalIgnoreCase in the dictionary constructor to avoid ToLowerInvariant() on every token.

3. BOM Detection on Every Read

FileInspector.TextReading.cs:25-42 - ReadHeadTextWithBomDetection() is called for every text read. This is fine for the current use case, but consider caching the BOM detection result if the same file is read multiple times.

Test Coverage

Strengths

New test PsScript_Populates_ScriptLanguage_And_Cmdlets() validates the cmdlet extraction feature
Test uses realistic PowerShell script sample

Gaps

No tests for new metadata APIs (InspectWithMetadata, ReadFileMetadata, CollectMetadata)
No tests for CompareDeclaredDetailed and dangerous alternative detection
No tests for TopTokens extraction (though disabled by default)
No tests for FileSystemMetadata option combinations
No tests for new helper methods in FileInspector.TextSubtype.cs

Recommendation: Add test coverage for the public APIs introduced in this PR.

Additional Observations

1. Version Bump

Version incremented from 1.0.5 to 1.0.6 across all projects - appropriate for a feature addition.

2. Documentation

XML comments are comprehensive and helpful. Consider adding a CHANGELOG.md entry for this release.

3. Backwards Compatibility

All changes appear to be additive (new properties, new methods) with no breaking changes. Existing code should continue to work.

4. Settings Proliferation

The Settings class now has 100+ configuration options. Consider:

Grouping related settings into nested configuration objects
Or documenting recommended preset configurations for common scenarios

Verdict

Recommendation: Approve with Minor Revisions

This is a solid PR that adds valuable functionality with good code quality. The issues identified are minor and can be addressed in follow-up commits or the next PR.

Must Fix Before Merge

None - no critical issues found.

Should Fix Soon

Add test coverage for new public APIs
Review exception handling strategy and add warning-level logging where appropriate
Consider security implications of default IncludeMagicHeader = true
Document collision resolution strategy in CollectMetadata

Nice to Have

Make StrongCandidateScoreThreshold configurable
Optimize token extraction for memory and GC pressure
Add file read caching to reduce I/O
Add CHANGELOG.md entry

Great work overall! The new metadata and text analysis capabilities will be very useful.

* Refactored `ReadHeadText` to cache results for improved performance. * Updated logging levels for error handling in metadata and text reading. * Enhanced `FileMetadataOptions` to default sensitive fields to false. * Introduced `TopTokensRedactPatterns` for sensitive token redaction. * Adjusted strong candidate score threshold to use settings for flexibility.

claude · 2025-12-27T15:53:19Z

Pull Request Review: v1.0.6 - PowerShell Detection & Metadata Enhancements

Overview

This PR introduces significant enhancements to FileInspectorX v1.0.6, adding PowerShell script detection, file system metadata reading capabilities, top token extraction, and improved text subtype mapping. The changes span 26 files with 1,206 additions and 62 deletions.

Code Quality & Best Practices

Strengths

Well-structured partial classes: Clean organization into FileInspector.Metadata.cs, FileInspector.TextMetrics.cs, FileInspector.TextReading.cs, etc.
Comprehensive error handling: Excellent try-catch blocks with appropriate logging levels in FileInspector.Metadata.cs:35-42, FileInspector.TextMetrics.cs:52-58, SecurityHeuristics.Cmdlets.cs:60-66
Strong documentation: Thorough XML comments include important security notes about sensitive data exposure
Test coverage: Added PsScript_Populates_ScriptLanguage_And_Cmdlets test validates core functionality
BOM detection: New ReadHeadTextWithBomDetection method properly handles UTF-8, UTF-16, and UTF-32 BOMs

Areas for Improvement

Caching logic complexity (FileInspector.Analyze.cs:205-218): ReadHeadTextCached closure is clever but adds complexity - consider extracting to a helper class
Hard-coded safety limits: TopTokensHardMaxUniqueTokens = 100,000 - consider making configurable and documenting rationale
String allocations (FileInspector.TextMetrics.cs:79): Converting ReadOnlySpan to string in tight loop could cause GC pressure
Mixed logging patterns: Ensure consistent use of if (Settings.Logger.IsWarning) checks before logging

Performance Considerations

Optimizations

Excellent span-based parsing
Proper early bailouts when disabled
Configurable limits prevent runaway processing

Potential Issues

Dictionary can grow beyond maxUnique (though line 82 check prevents this)
Multiple ReadHeadText calls with different caps - consider reading once at max size
String interpolation in loop (minor, acceptable given bounded iteration)

Security Concerns

Security Strengths

Redaction support: TopTokensRedactPatterns provides configurable filtering
Opt-in sensitive features: Privacy-impacting features default to false
Clear security warnings: XML docs note potential data exposure
Bounded processing: Hard limits prevent resource exhaustion

Recommendations

Token redaction uses simple IndexOf - consider regex for sophisticated patterns
Verify path disclosure when Exists = false is intentional
Cmdlet extraction is safe (read-only) but consider obfuscation warnings

Test Coverage

Good Coverage

New test validates PowerShell cmdlet extraction
Existing tests maintained

Suggested Additions

Metadata tests for InspectWithMetadata, ReadFileMetadata, CollectMetadata
Top tokens edge cases (empty files, stop words only, size limits, redaction)
BOM detection tests for all BOM types
Caching behavior tests
Error path tests (files deleted mid-analysis)

Additional Observations

Positive Changes

Detection summary caching improves API ergonomics
PE kind detection (exe/dll/sys) is valuable
Centralized script language mapping is cleaner

Questions

PR title "Move tier bridge" doesnt match changes - consider more descriptive title
New FileAnalysis properties are additions (likely non-breaking) - confirm?
Version 1.0.5 to 1.0.6 appropriate for feature additions

Summary

High-quality PR with well-thought-out features and good defensive coding. Main improvements needed:

Add comprehensive unit tests for new features
Performance optimizations for token dictionary and caching
Clarify PR title
Document hard-coded limit rationale

Recommendation: Approve with minor suggestions

Code is production-ready with good error handling, security awareness, and documentation. Suggested improvements can be addressed in follow-up work.

Review generated with Claude Code

PrzemyslawKlys added 2 commits December 26, 2025 18:41

Copilot AI review requested due to automatic review settings December 26, 2025 19:57

Copilot started reviewing on behalf of PrzemyslawKlys December 26, 2025 19:57 View session

Copilot AI reviewed Dec 26, 2025

View reviewed changes

PrzemyslawKlys added 2 commits December 27, 2025 13:22

chore(version): 🔧 Bump version to 1.0.6

d06ccd4

* Updated version numbers in project files to reflect the new release. * Ensured consistency across `Update-Version.ps1`, `.csproj`, and module manifest files.

PrzemyslawKlys added 2 commits December 27, 2025 13:47

chore: 🔧 Fix formatting issue in FileInspectorX.psd1

8540f6a

* Corrected the formatting of the module manifest. * Ensured proper syntax for PowerShell module compatibility.

PrzemyslawKlys merged commit 3e07c6b into master Dec 27, 2025
1 check passed

PrzemyslawKlys deleted the MoveTierBridge branch December 27, 2025 16:01

	private static bool IsTrimChar(char c) => c == '-' \|\| c == '_' \|\| c == '.' \|\| c == ':';
	private static bool IsTrimChar(char c) => c == '-' \|\| c == '_' \|\| c == '.';

	var detectedExt = (det?.Extension ?? string.Empty).Trim().TrimStart('.').ToLowerInvariant();
	var detectedExt = det?.Extension?.Trim().TrimStart('.').ToLowerInvariant() ?? string.Empty;

	if (det != null && det.Extension == "pdf") {
	if (det.Extension == "pdf") {

	if (det != null && det.Extension is "doc" or "xls" or "ppt")
	if (det != null && (det.Extension == "doc" \|\| det.Extension == "xls" \|\| det.Extension == "ppt"))

Uh oh!

Move tier bridge #12

Move tier bridge #12

Uh oh!

Conversation

PrzemyslawKlys commented Dec 26, 2025

Uh oh!

claude bot commented Dec 26, 2025

PR Review: Move Tier Bridge

✅ Strengths

🐛 Potential Bugs & Issues

1. BOM Handling Issue in ReadHeadText (FileInspector.Analyze.cs:2094-2106)

2. Cmdlet Extraction Only for PowerShell (FileInspector.Analyze.cs:651-656)

3. Missing Null Check in TryPopulateTextMetrics (FileInspector.TextMetrics.cs:18-52)

🔒 Security Considerations

1. Token Extraction from Untrusted Content (FileInspector.TextMetrics.cs:54-91)

2. Script Hint Extraction (SecurityHeuristics.Cmdlets.cs:8-58)

⚡ Performance Considerations

1. Repeated String Allocations (FileInspector.TextMetrics.cs:67-73)

2. TextSubtype Assignment Duplication (FileInspector.Analyze.cs:577-625)

🧪 Test Coverage Assessment

📋 Code Quality & Best Practices

Positives:

Suggestions:

📝 Minor Issues

🎯 Summary & Recommendation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot commented Dec 26, 2025

Pull Request Review: Move tier bridge

Summary

Code Quality & Best Practices

Strengths ⭐

Suggestions for Improvement 📝

Potential Bugs & Issues

Critical Issues 🔴

Medium Priority Issues 🟡

Low Priority Issues 🟢

Performance Considerations

Positive Aspects ⚡

Potential Concerns ⚠️

Security Concerns

Overall Security Posture: Strong 🔒

Minor Observations

Test Coverage

Current Coverage: Adequate ✅

Recommendations for Additional Tests 🧪

Suggested test additions:

1. BOM Handling Issue in `ReadHeadText` (FileInspector.Analyze.cs:2094-2106)

3. Missing Null Check in `TryPopulateTextMetrics` (FileInspector.TextMetrics.cs:18-52)