Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the Thunderstorm collector Bash script to improve portability, error handling, and compatibility with older systems (Bash 3+). The refactoring transforms the script from a simple ~177-line utility into a more robust ~533-line tool with comprehensive argument parsing, graceful degradation, and better error handling.
Changes:
- Added comprehensive CLI argument parsing with --help, --dry-run, --debug, and many other options
- Implemented curl/wget fallback support with portable multipart upload for systems lacking curl
- Enhanced logging with multiple destinations (file, syslog, command line) and proper error sanitization
- Added input validation, retry logic, temporary file cleanup, and trap handlers for graceful shutdown
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| scripts/thunderstorm-collector.sh | Complete refactoring with hardened error handling, dual upload tool support, argument parsing, and Bash 3+ compatibility |
| scripts/README.md | Updated requirements to reflect curl/wget alternatives and added usage examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… bug Test suite (scripts/tests/run_tests.sh): - 28 tests across validation, upload modes, filtering, robustness, and integration - Two modes: stub server (CI) and external server (real Thunderstorm) - Starts/stops thunderstorm-stub-server automatically in CI mode - TEST_FILTER env var for running test subsets - Proper exit codes, ephemeral ports, auto-cleanup Bug fix (thunderstorm-collector.sh): - find_mtime was initialized before parse_args(), so --max-age CLI flag was silently ignored and always used the default (14 days) - Moved initialization after parse_args() - Added regression test: test_max_age_cli_override_applied
Python (5 fixes): - URL-encode source parameter (was raw, broke on spaces/special chars) - Add port default=8080 (was None, URL became ':None') - Sanitize filename in multipart header (quotes/semicolons broke parsing) - Fix Retry-After: int() conversion (was str, crashed with TypeError) - Fix num_submitted: only increment on HTTP 200 (was counting failures) - Replace os.chdir/os.listdir with os.walk (CWD corruption on errors) Perl (4 fixes): - Fix size comparison: gt -> > (string compare made 2-9MB files skip at limit 10) - Fix age comparison: lt -> < (same class of bug, latent) - URL-encode source parameter - Fix num_submitted: only increment on success - Send full filepath as multipart filename (was basename only via LWP default) PowerShell (2 fixes): - Extensions preset no longer overwrites -Extensions CLI parameter - Add retry limit (10) for HTTP 503 (was infinite loop) See scripts/AUDIT.md for full analysis with reproduction steps.
Addressing Copilot Review CommentsAll four concerns raised by Copilot have already been addressed in the current version of the script: 1. Filename in Content-Disposition header (unsanitized) 2. Filename in curl --form (unsanitized) 3. SOURCE_NAME not URL-encoded 4. Process substitution requires Bash 3.1+ |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1. Filesystem exclusions: Added /proc, /sys, /dev, /run, /snap, /.snapshots pruning to Bash and Ash collectors (find -prune). Extended hard_skips in Python 3, Python 2, and Perl. Added symlink-to-file skip in Perl. 2. Filter test suite: New run_filter_tests.sh with 54 tests covering max-age, max-size, and extension filtering across all 7 collectors. 3. Collection markers: All collectors now POST begin/end markers to /api/collection. Server returns a scan_id on begin; collectors append it to all subsequent upload URLs. End marker includes run statistics. Fully forward-compatible (silent failure if server doesn't support it).
New -AllExtensions switch disables extension filtering and submits all files regardless of extension. Overrides -Extensions and the default preset list. Available on both PS 3+ and PS 2+ collectors.
PR Description (for copy-paste into the PR body)SummaryComprehensive hardening of all script collectors (Bash, Python, Perl, PowerShell, Batch) with bug fixes, new collectors for legacy/embedded platforms, a CI test suite, and expanded documentation. Bug FixesIdentified and fixed 10 bugs across the existing collectors (full details in
Bash Collector RewriteRewritten from ~177 to ~600+ lines: full CLI, curl/wget fallback, retry with backoff, safe filename handling, URL-encoded source, syslog, filesystem exclusions via New Collectors
New Features
CI / Testing
Tested Environments
Related |
Cloud storage exclusions: - Detect and skip directories named OneDrive, Dropbox, Google Drive, iCloud Drive, Nextcloud, ownCloud, MEGA, Tresorit, SyncThing - Also handles dynamic variants (OneDrive - OrgName, Nextcloud-account) - macOS ~/Library/CloudStorage support Network and special filesystem mount exclusions: - Parse /proc/mounts to detect NFS, CIFS, SMBFS, SSHFS, WebDAV, S3FS, rclone, and ~25 special filesystem types (tmpfs, debugfs, cgroup, etc.) - Mount points are pruned at the find level (Bash/Ash) or added to hard_skips (Python/Perl) before the directory walk begins Implemented across all Unix collectors: Bash, Ash, Python 3, Python 2, and Perl. Verified with test fixtures containing OneDrive subdirectory.
Script collector hardening: filters, exclusions, collection markers
tmpfs is used for /tmp on many systems (Fedora, etc.) and for container volume mounts — excluding it would silently skip the scan target. overlay is the root filesystem inside containers. Both are too common for user data to blanket-exclude. The remaining special FS types (proc, sysfs, debugfs, cgroup, etc.) are genuinely non-scannable.
Fix: remove tmpfs and overlay from special FS exclusion list
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 17 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The JSON response from /api/collection may contain whitespace around the colon (e.g. "scan_id": "...") depending on the server's JSON serializer. The grep pattern was too strict and only matched the no-space variant, causing scan_id to silently not propagate to upload URLs. Fixed by using a regex that allows optional whitespace around the colon in both Bash and Ash collectors.
Fix FORFILES /D filter semantics (inverted)
The go/collector_external_integration_test.go file tests script collectors but doesn't belong in a branch focused on script robustness improvements. This test can be re-added in a follow-up PR if needed.
chore: remove Go integration test from script-robustness
…ness
PS2 collector:
- Get-JsonValue crashed on .Replace('\f', [char]0x0C) because
PowerShell tried the Char overload with a 2-char string.
Fix: cast replacement to [string][char]0x0C.
- This caused PS2 to abort when the server supports collection
markers (stub server), though it worked against real Thunderstorm
(which returns 404 for /api/collection).
Filter test harness (run_filter_tests.sh):
- patch_perl(): sed pattern matched $max_size instead of
$max_size_kb — patch was silently ineffective.
- patch_python()/patch_python2(): only patched the global default;
argparse default=2048 overwrote it. Now patches both.
- Perl invocations used '--' option terminator which suppressed all
CLI args (server, port, dir were ignored).
- Fixed max_size values: now in KB (matching the variable semantics)
instead of ambiguous MB/KB values that broke after the fix.
New: run_e2e_compliance.sh
- 73-test E2E compliance suite verifying source, filename, MD5
integrity, collection markers, zero-byte/binary files, special
chars, and spaces across all 7 collectors + real Thunderstorm.
Test results after fix:
- run_tests.sh: 28/28 passed
- run_filter_tests.sh: 54/54 passed (was 47/54)
- run_e2e_compliance.sh: 73/73 passed (was 52/72)
fix(ps2): repair Get-JsonValue \f Replace type mismatch; fix test harness
…Disposition The Thunderstorm server reads the file path from the Content-Disposition filename parameter on the 'file' multipart field — not from separate 'hostname', 'filename', or 'source_path' form fields. Remove these invented fields from all collectors (bash, ash, python, py2, ps1, ps2, bat, perl). The full path is already sent correctly via Content-Disposition in all scripts. BAT: curl defaults to basename with -F file=@path, so explicitly set filename=!_FILE! to preserve the full path for filename IOC matching.
- Tests malicious file detection (YARA content match) - Tests benign files (no match expected) - Tests filename IOC detection (/tmp/x triggers FilenameIOC_Tmp_SingleChar) - Tests file size filtering (large files filtered correctly) - Tests full path preservation in thunderstorm log - Supports bash, python, perl, ps3 (PowerShell 3+), ps2 (PowerShell 2) - Uses unique filenames per collector to avoid cross-test conflicts - Includes sync delay to handle async log writes - Fixed parameter names for PS collectors (-ThunderstormServer, -ThunderstormPort, -Folder) - Fixed PS size unit conversion (KB → MB) - Fixed Perl directory parameter format (--dir)
…bdirectory recursion Negative tests verify collectors DON'T do things they shouldn't: - dir-scope: scanning /target must NOT pick up files from sibling /decoy - age-filter: files timestomped to 60 days ago are NOT submitted with --max-age 1 - ext-filter (PS only): .xyz files are NOT submitted with default extension list - subdir-recursion: files found at all directory levels (positive) Also refactored collector runners to allow per-test --max-age override. PS args (--max-size-kb, --max-age) now translated via _translate_ps_args. Results: 52 passed, 0 failed, 3 skipped (ext-filter N/A for bash/python/perl)
…cial chars, permissions New tests (all pass across bash/python/perl/ps3/ps2): - empty-file: 0-byte files must not crash or produce content false positive - unicode-filename: files with Unicode chars in name handled correctly - symlink: symlinks must NOT be followed (security - no directory escape) - broken-symlink: dangling symlinks must not crash the collector - special-chars: filenames with spaces and parentheses handled correctly - excluded-dirs: directories named proc/dev don't crash collector - unreadable-file: chmod 000 files don't crash, other files still processed Also fixed: - query_log now uses LOG_OFFSET to only search entries after clear_log() (prevents stale entries from previous runs causing false failures) - clear_log() calls mark_log_position() instead of no-op - --max-age now passed per-test (not hardcoded in runners) - PS arg translation consolidated into _translate_ps_args() Total: 18 tests × 5 collectors = 90 tests (3 skipped for ext-filter N/A)
New tests: - server-unreachable: collector exits gracefully when server is down (all pass) - retry-recovery: collector retries and succeeds when server comes up late (bash passes; python/perl/ps3/ps2 need timing tuning - the delayed stub may not start fast enough for their retry windows) Findings: - ps2/symlink: PS2 collector FOLLOWS symlinks (security risk!) All other collectors correctly skip symlinks. Fixes: - query_log now uses LOG_OFFSET (via mark_log_position/clear_log) to only search entries added since the last test, preventing stale log pollution - Unique ports per collector for retry tests (18101-18105) - Cleanup kills orphaned retry stubs on exit Results: 92 passed, 5 failed, 3 skipped
PS2 collector fix: - Skip files with ReparsePoint attribute (symbolic links) - Prevents directory escape via symlinks placed in scan directory - All other collectors already handled this correctly Test fix: - Reduced stub startup delay from 2s to 1s in retry-recovery test - Collectors retry begin marker after 2s; stub must be ready by then - All 5 collectors now pass retry-recovery test Results: 97 passed, 0 failed, 3 skipped (ext-filter N/A for bash/python/perl)
fix: remove source_path fields, send full path as filename
Detection tests: - Fix retry-on-late-server test timing: reduce stub start delay from 1s to 0.3s so all collectors (including PS2/PS3) have time to connect on their second begin-marker attempt at t=2s. New operational test suite (run_operational_tests.sh): - Collection markers: begin/end with matching scan_id and stats - Interrupted marker: SIGINT sends interrupted/end marker - Dry-run mode: no uploads, file listed in output (bash/python/perl) - Source identifier: --source sets source field in markers - Sync mode: --sync uses /api/check (bash/python/perl) - Multiple directories: scanning multiple dirs (bash/python) - 503 back-pressure: Retry-After handling across all collectors - Progress reporting: --progress flag works without crash - Syslog logging: --syslog works (bash only) - wget fallback: tested when curl/wget can be isolated Results: 35 passed, 0 failed, 20 skipped (skips are N/A features)
Previously, repeated --dir flags would overwrite each other due to GetOptions binding to a scalar. Changed to array binding so multiple directories are accumulated and all scanned sequentially. Usage: --dir /path1 --dir /path2 --dir /path3 Matches bash collector behavior (--dir is repeatable).
…ox warning - nc upload: HTTP/1.1 → HTTP/1.0 to avoid chunked transfer-encoding that breaks raw response parsing with sed/grep - boundary check: prefix grep -qF with LC_ALL=C to handle binary files reliably across GNU grep implementations - BusyBox wget: expanded warning to explicitly state binary files will be silently corrupted and recommend installing curl/GNU wget
fix: HTTP/1.0 for nc uploads, LC_ALL=C for binary grep, clearer BusyBox warning
When mktemp is unavailable, the fallback now creates a private directory first (mkdir is atomic with umask 077), then creates temp files inside it. This eliminates the race window where an attacker could predict the filename and pre-create a symlink in /tmp. Cleanup updated to remove the fallback directory on exit.
fix: TOCTOU race in mktemp_portable fallback
… retry, use set for hard_skips - Filename sanitization: also replace backslashes and NUL bytes in Content-Disposition header to prevent header injection - Connection close: initialize conn=None before try block so the finally block doesn't hit NameError if _make_connection raises - Retry-After: clamp to 0-300s with max(0, min(...)) instead of separate negative check - hard_skips: use a set for O(1) membership checks instead of O(n) list scan in the mount-point dedup loop
fix(py): filename sanitization, conn guard, retry clamp, set for hard_skips
…Perl 5.8 - /proc/mounts: decode \040 and other octal escapes in mount points so paths with spaces are correctly recognized and excluded - Encode::FB_DEFAULT(): use numeric constant 0x0001 instead of the named function, which was only added in Encode 2.53 (Perl 5.14). The script targets Perl 5.8+, where the function doesn't exist.
fix(pl): /proc/mounts octal escapes + Encode compat for Perl 5.8
PR #15 Overview — Complete Collector Script Rewrite
This PR brings the entire collector suite from v0.2–0.4 to v0.5.0, adding +11,183 lines of improvements across all 8 scripts.
Scripts Improved
.sh.bat.ps1.ps2.ps1.py.py2.py.pl.ash.shFeatures Added
Core Functionality
/api/collectionendpoints; collectors now warn and continue instead of failingbegin/endcalls create scan sessions with metadata (hostname, source, scan_id)--ca-certand--insecureflags for custom CA bundles--no-progress)Retry & Resilience
Retry-Afterheader with 5-minute cap--retries Nfor custom retry countsFilesystem Improvements
--all-extensionsoverridesTests Added
.github/workflows/script-collectors-e2e.ymltests all scripts against a stub serverscripts/tests/run_filter_tests.shvalidates filesystem exclusionsDocumentation
review-glm5.md,review-grok.md— external LLM code review resultsBug Fixes