Skip to content

Improved Collector Scripts#15

Open
Neo23x0 wants to merge 65 commits intomasterfrom
script-robustness
Open

Improved Collector Scripts#15
Neo23x0 wants to merge 65 commits intomasterfrom
script-robustness

Conversation

@Neo23x0
Copy link
Copy Markdown
Contributor

@Neo23x0 Neo23x0 commented Feb 22, 2026

PR #15 Overview — Complete Collector Script Rewrite

This PR brings the entire collector suite from v0.2–0.4 to v0.5.0, adding +11,183 lines of improvements across all 8 scripts.

Scripts Improved

Script Platform Before After +Lines
.sh Bash (Linux/macOS) ~400 1,132 +732
.bat Windows Batch 213 519 +306
.ps1 PowerShell 3+ 430 911 +481
.ps2.ps1 PowerShell 2.0 ~200 986 +786
.py Python 3 ~400 758 +358
.py2.py Python 2.7 ~300 514 +214
.pl Perl 5 ~350 642 +292
.ash.sh POSIX/BusyBox ~300 1,141 +841

Features Added

Core Functionality

  • 404 Graceful Handling — Older Thunderstorm versions lack /api/collection endpoints; collectors now warn and continue instead of failing
  • Collection Markersbegin/end calls create scan sessions with metadata (hostname, source, scan_id)
  • Signal Handling — SIGINT/SIGTERM handlers write "interrupted" collection markers before exit
  • TLS Options--ca-cert and --insecure flags for custom CA bundles
  • Progress Reporting — Percentage progress during uploads (disable with --no-progress)
  • Exit Codes — 0=success, 1=partial failure, 2=fatal error

Retry & Resilience

  • Exponential Backoff — 2, 4, 8, 16s intervals for transient failures
  • 503 Retry-After — Respects Retry-After header with 5-minute cap
  • Configurable Retries--retries N for custom retry counts

Filesystem Improvements

  • Smart Exclusions — Excludes cloud storage (Dropbox, OneDrive, iCloud), network mounts, special filesystems (proc, sysfs, tmpfs, overlay)
  • Extension Presets — Built-in recommended extensions; --all-extensions overrides
  • MaxSize Cap — Hard limit of 200MB to prevent OOM

Tests Added

  • E2E Workflow.github/workflows/script-collectors-e2e.yml tests all scripts against a stub server
  • Platform Matrix — Tests run on Ubuntu (Bash, Python3, Perl, Ash), Windows (Batch, PS1, PS2), and Raspberry Pi
  • Filter Testsscripts/tests/run_filter_tests.sh validates filesystem exclusions

Documentation

  • README — Expanded README with CLI reference, examples, and environment table
  • AUDIT.md — Security audit documentation for enterprise review
  • Cross-model Reviewreview-glm5.md, review-grok.md — external LLM code review results

Bug Fixes

  • Perl — Fixed eval block returning falsy on success (caused "Unknown failure" errors)
  • Bash — Signal handler safety with trap isolation and progress output cleanup
  • PowerShell — Em-dash encoding issue (replaced with ASCII dashes)
  • Batch — Two-phase architecture with FORFILES for proper error handling

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Thunderstorm collector Bash script to improve portability, error handling, and compatibility with older systems (Bash 3+). The refactoring transforms the script from a simple ~177-line utility into a more robust ~533-line tool with comprehensive argument parsing, graceful degradation, and better error handling.

Changes:

  • Added comprehensive CLI argument parsing with --help, --dry-run, --debug, and many other options
  • Implemented curl/wget fallback support with portable multipart upload for systems lacking curl
  • Enhanced logging with multiple destinations (file, syslog, command line) and proper error sanitization
  • Added input validation, retry logic, temporary file cleanup, and trap handlers for graceful shutdown

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
scripts/thunderstorm-collector.sh Complete refactoring with hardened error handling, dual upload tool support, argument parsing, and Bash 3+ compatibility
scripts/README.md Updated requirements to reflect curl/wget alternatives and added usage examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Neo23x0 and others added 6 commits February 22, 2026 02:01
… bug

Test suite (scripts/tests/run_tests.sh):
- 28 tests across validation, upload modes, filtering, robustness, and integration
- Two modes: stub server (CI) and external server (real Thunderstorm)
- Starts/stops thunderstorm-stub-server automatically in CI mode
- TEST_FILTER env var for running test subsets
- Proper exit codes, ephemeral ports, auto-cleanup

Bug fix (thunderstorm-collector.sh):
- find_mtime was initialized before parse_args(), so --max-age CLI flag
  was silently ignored and always used the default (14 days)
- Moved initialization after parse_args()
- Added regression test: test_max_age_cli_override_applied
Python (5 fixes):
- URL-encode source parameter (was raw, broke on spaces/special chars)
- Add port default=8080 (was None, URL became ':None')
- Sanitize filename in multipart header (quotes/semicolons broke parsing)
- Fix Retry-After: int() conversion (was str, crashed with TypeError)
- Fix num_submitted: only increment on HTTP 200 (was counting failures)
- Replace os.chdir/os.listdir with os.walk (CWD corruption on errors)

Perl (4 fixes):
- Fix size comparison: gt -> > (string compare made 2-9MB files skip at limit 10)
- Fix age comparison: lt -> < (same class of bug, latent)
- URL-encode source parameter
- Fix num_submitted: only increment on success
- Send full filepath as multipart filename (was basename only via LWP default)

PowerShell (2 fixes):
- Extensions preset no longer overwrites -Extensions CLI parameter
- Add retry limit (10) for HTTP 503 (was infinite loop)

See scripts/AUDIT.md for full analysis with reproduction steps.
@Neo23x0 Neo23x0 changed the title Harden thunderstorm-collector.sh Improved Collector Scripts Feb 23, 2026
@Colossus14
Copy link
Copy Markdown

Addressing Copilot Review Comments

All four concerns raised by Copilot have already been addressed in the current version of the script:

1. Filename in Content-Disposition header (unsanitized)
→ The refactored script uses sanitize_filename_for_multipart() which strips double quotes, CR/LF, and null bytes. The sanitized safe_filename is used in the Content-Disposition header (wget/netcat paths) and in the curl --form argument.

2. Filename in curl --form (unsanitized)
→ Same fix — sanitize_filename_for_multipart() is applied before the curl --form call. The raw filename never reaches curl.

3. SOURCE_NAME not URL-encoded
→ The script includes a urlencode() function that percent-encodes non-alphanumeric characters. It is applied to SOURCE_NAME via build_query_source() before the value reaches any URL.

4. Process substitution requires Bash 3.1+
→ Acknowledged. Process substitution has been available since Bash 3.1 (2005). Bash 3.0 is effectively extinct on any maintained system. The stated target of "Bash 3+" is accurate in practice; we could note "3.1+" for strict correctness, but it is unlikely to affect any real deployment.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

1. Filesystem exclusions: Added /proc, /sys, /dev, /run, /snap, /.snapshots
   pruning to Bash and Ash collectors (find -prune). Extended hard_skips in
   Python 3, Python 2, and Perl. Added symlink-to-file skip in Perl.

2. Filter test suite: New run_filter_tests.sh with 54 tests covering
   max-age, max-size, and extension filtering across all 7 collectors.

3. Collection markers: All collectors now POST begin/end markers to
   /api/collection. Server returns a scan_id on begin; collectors append
   it to all subsequent upload URLs. End marker includes run statistics.
   Fully forward-compatible (silent failure if server doesn't support it).
New -AllExtensions switch disables extension filtering and submits all
files regardless of extension. Overrides -Extensions and the default
preset list. Available on both PS 3+ and PS 2+ collectors.
@Colossus14
Copy link
Copy Markdown

PR Description (for copy-paste into the PR body)

Summary

Comprehensive hardening of all script collectors (Bash, Python, Perl, PowerShell, Batch) with bug fixes, new collectors for legacy/embedded platforms, a CI test suite, and expanded documentation.

Bug Fixes

Identified and fixed 10 bugs across the existing collectors (full details in scripts/AUDIT.md):

  • Bash: --max-age CLI flag silently ignored (find_mtime initialized before argument parsing)
  • Python: source parameter not URL-encoded; Content-Disposition filename not sanitized; num_submitted incremented on failure; time.sleep() crash on string Retry-After header; port default None instead of 8080; nargs="*" default="/" (string, not list)
  • Perl: gt/lt string operators instead of >/< for numeric size comparison; source not URL-encoded; submitted count inflated on failure
  • PowerShell: -Extensions parameter silently overwritten by hardcoded preset; infinite retry loop on HTTP 503

Bash Collector Rewrite

Rewritten from ~177 to ~600+ lines: full CLI, curl/wget fallback, retry with backoff, safe filename handling, URL-encoded source, syslog, filesystem exclusions via find -prune.

New Collectors

  • thunderstorm-collector-ash.sh — POSIX sh for BusyBox/Alpine/IoT. Three upload backends (curl → wget → nc).
  • thunderstorm-collector-py2.py — Python 2.7 stdlib-only for legacy systems.
  • thunderstorm-collector-ps2.ps1 — PowerShell 2.0 via HttpWebRequest.

New Features

  • Collection init/end markers: All collectors POST begin/end to /api/collection, receive a scan_id, append it to uploads, report stats. Forward-compatible.
  • Filesystem exclusions: /proc, /sys, /dev, /run, /snap pruned across all Unix collectors.
  • -AllExtensions switch: PowerShell collectors can now submit all files regardless of extension.

CI / Testing

  • 28 bash tests + 54 filter tests across all collectors
  • GitHub Actions e2e workflow against stub server

Tested Environments

Collector Fedora 43 CentOS 7 Debian 9 Alpine 3.18 BusyBox 1.36 Windows 11
Bash ✅ 5.2 ✅ 4.2 ✅ 4.4 ✅ 5.2
POSIX sh/ash ✅ dash ✅ dash ✅ ash ✅ ash (nc)
Python 3 ✅ 3.14 ✅ 3.6 ✅ 3.5 ✅ 3.11
Python 2 ✅ 2.7.5
Perl ✅ 5.40 ✅ 5.16 ✅ 5.24 ✅ 5.36
PowerShell 3+ ✅ pwsh 7.4 ✅ 5.1
PowerShell 2+ ✅ pwsh 7.4 ✅ 5.1

Related

Colossus14 and others added 4 commits February 24, 2026 01:19
Cloud storage exclusions:
- Detect and skip directories named OneDrive, Dropbox, Google Drive,
  iCloud Drive, Nextcloud, ownCloud, MEGA, Tresorit, SyncThing
- Also handles dynamic variants (OneDrive - OrgName, Nextcloud-account)
- macOS ~/Library/CloudStorage support

Network and special filesystem mount exclusions:
- Parse /proc/mounts to detect NFS, CIFS, SMBFS, SSHFS, WebDAV, S3FS,
  rclone, and ~25 special filesystem types (tmpfs, debugfs, cgroup, etc.)
- Mount points are pruned at the find level (Bash/Ash) or added to
  hard_skips (Python/Perl) before the directory walk begins

Implemented across all Unix collectors: Bash, Ash, Python 3, Python 2,
and Perl. Verified with test fixtures containing OneDrive subdirectory.
Script collector hardening: filters, exclusions, collection markers
tmpfs is used for /tmp on many systems (Fedora, etc.) and for
container volume mounts — excluding it would silently skip the
scan target. overlay is the root filesystem inside containers.

Both are too common for user data to blanket-exclude. The remaining
special FS types (proc, sysfs, debugfs, cgroup, etc.) are genuinely
non-scannable.
Fix: remove tmpfs and overlay from special FS exclusion list
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The JSON response from /api/collection may contain whitespace around
the colon (e.g. "scan_id": "...") depending on the server's JSON
serializer. The grep pattern was too strict and only matched the
no-space variant, causing scan_id to silently not propagate to
upload URLs.

Fixed by using a regex that allows optional whitespace around the
colon in both Bash and Ash collectors.
Neo23x0 and others added 30 commits March 12, 2026 09:29
Fix FORFILES /D filter semantics (inverted)
The go/collector_external_integration_test.go file tests script collectors
but doesn't belong in a branch focused on script robustness improvements.

This test can be re-added in a follow-up PR if needed.
chore: remove Go integration test from script-robustness
…ness

PS2 collector:
- Get-JsonValue crashed on .Replace('\f', [char]0x0C) because
  PowerShell tried the Char overload with a 2-char string.
  Fix: cast replacement to [string][char]0x0C.
- This caused PS2 to abort when the server supports collection
  markers (stub server), though it worked against real Thunderstorm
  (which returns 404 for /api/collection).

Filter test harness (run_filter_tests.sh):
- patch_perl(): sed pattern matched $max_size instead of
  $max_size_kb — patch was silently ineffective.
- patch_python()/patch_python2(): only patched the global default;
  argparse default=2048 overwrote it. Now patches both.
- Perl invocations used '--' option terminator which suppressed all
  CLI args (server, port, dir were ignored).
- Fixed max_size values: now in KB (matching the variable semantics)
  instead of ambiguous MB/KB values that broke after the fix.

New: run_e2e_compliance.sh
- 73-test E2E compliance suite verifying source, filename, MD5
  integrity, collection markers, zero-byte/binary files, special
  chars, and spaces across all 7 collectors + real Thunderstorm.

Test results after fix:
- run_tests.sh:          28/28 passed
- run_filter_tests.sh:   54/54 passed (was 47/54)
- run_e2e_compliance.sh: 73/73 passed (was 52/72)
fix(ps2): repair Get-JsonValue \f Replace type mismatch; fix test harness
…Disposition

The Thunderstorm server reads the file path from the Content-Disposition
filename parameter on the 'file' multipart field — not from separate
'hostname', 'filename', or 'source_path' form fields.

Remove these invented fields from all collectors (bash, ash, python,
py2, ps1, ps2, bat, perl). The full path is already sent correctly
via Content-Disposition in all scripts.

BAT: curl defaults to basename with -F file=@path, so explicitly set
filename=!_FILE! to preserve the full path for filename IOC matching.
- Tests malicious file detection (YARA content match)
- Tests benign files (no match expected)
- Tests filename IOC detection (/tmp/x triggers FilenameIOC_Tmp_SingleChar)
- Tests file size filtering (large files filtered correctly)
- Tests full path preservation in thunderstorm log
- Supports bash, python, perl, ps3 (PowerShell 3+), ps2 (PowerShell 2)
- Uses unique filenames per collector to avoid cross-test conflicts
- Includes sync delay to handle async log writes
- Fixed parameter names for PS collectors (-ThunderstormServer, -ThunderstormPort, -Folder)
- Fixed PS size unit conversion (KB → MB)
- Fixed Perl directory parameter format (--dir)
…bdirectory recursion

Negative tests verify collectors DON'T do things they shouldn't:
- dir-scope: scanning /target must NOT pick up files from sibling /decoy
- age-filter: files timestomped to 60 days ago are NOT submitted with --max-age 1
- ext-filter (PS only): .xyz files are NOT submitted with default extension list
- subdir-recursion: files found at all directory levels (positive)

Also refactored collector runners to allow per-test --max-age override.
PS args (--max-size-kb, --max-age) now translated via _translate_ps_args.

Results: 52 passed, 0 failed, 3 skipped (ext-filter N/A for bash/python/perl)
…cial chars, permissions

New tests (all pass across bash/python/perl/ps3/ps2):
- empty-file: 0-byte files must not crash or produce content false positive
- unicode-filename: files with Unicode chars in name handled correctly
- symlink: symlinks must NOT be followed (security - no directory escape)
- broken-symlink: dangling symlinks must not crash the collector
- special-chars: filenames with spaces and parentheses handled correctly
- excluded-dirs: directories named proc/dev don't crash collector
- unreadable-file: chmod 000 files don't crash, other files still processed

Also fixed:
- query_log now uses LOG_OFFSET to only search entries after clear_log()
  (prevents stale entries from previous runs causing false failures)
- clear_log() calls mark_log_position() instead of no-op
- --max-age now passed per-test (not hardcoded in runners)
- PS arg translation consolidated into _translate_ps_args()

Total: 18 tests × 5 collectors = 90 tests (3 skipped for ext-filter N/A)
New tests:
- server-unreachable: collector exits gracefully when server is down (all pass)
- retry-recovery: collector retries and succeeds when server comes up late
  (bash passes; python/perl/ps3/ps2 need timing tuning - the delayed
   stub may not start fast enough for their retry windows)

Findings:
- ps2/symlink: PS2 collector FOLLOWS symlinks (security risk!)
  All other collectors correctly skip symlinks.

Fixes:
- query_log now uses LOG_OFFSET (via mark_log_position/clear_log) to only
  search entries added since the last test, preventing stale log pollution
- Unique ports per collector for retry tests (18101-18105)
- Cleanup kills orphaned retry stubs on exit

Results: 92 passed, 5 failed, 3 skipped
PS2 collector fix:
- Skip files with ReparsePoint attribute (symbolic links)
- Prevents directory escape via symlinks placed in scan directory
- All other collectors already handled this correctly

Test fix:
- Reduced stub startup delay from 2s to 1s in retry-recovery test
- Collectors retry begin marker after 2s; stub must be ready by then
- All 5 collectors now pass retry-recovery test

Results: 97 passed, 0 failed, 3 skipped (ext-filter N/A for bash/python/perl)
fix: remove source_path fields, send full path as filename
Detection tests:
- Fix retry-on-late-server test timing: reduce stub start delay from
  1s to 0.3s so all collectors (including PS2/PS3) have time to connect
  on their second begin-marker attempt at t=2s.

New operational test suite (run_operational_tests.sh):
- Collection markers: begin/end with matching scan_id and stats
- Interrupted marker: SIGINT sends interrupted/end marker
- Dry-run mode: no uploads, file listed in output (bash/python/perl)
- Source identifier: --source sets source field in markers
- Sync mode: --sync uses /api/check (bash/python/perl)
- Multiple directories: scanning multiple dirs (bash/python)
- 503 back-pressure: Retry-After handling across all collectors
- Progress reporting: --progress flag works without crash
- Syslog logging: --syslog works (bash only)
- wget fallback: tested when curl/wget can be isolated

Results: 35 passed, 0 failed, 20 skipped (skips are N/A features)
Previously, repeated --dir flags would overwrite each other due to
GetOptions binding to a scalar. Changed to array binding so multiple
directories are accumulated and all scanned sequentially.

Usage: --dir /path1 --dir /path2 --dir /path3

Matches bash collector behavior (--dir is repeatable).
…ox warning

- nc upload: HTTP/1.1 → HTTP/1.0 to avoid chunked transfer-encoding
  that breaks raw response parsing with sed/grep
- boundary check: prefix grep -qF with LC_ALL=C to handle binary files
  reliably across GNU grep implementations
- BusyBox wget: expanded warning to explicitly state binary files will
  be silently corrupted and recommend installing curl/GNU wget
fix: HTTP/1.0 for nc uploads, LC_ALL=C for binary grep, clearer BusyBox warning
When mktemp is unavailable, the fallback now creates a private directory
first (mkdir is atomic with umask 077), then creates temp files inside
it. This eliminates the race window where an attacker could predict the
filename and pre-create a symlink in /tmp.

Cleanup updated to remove the fallback directory on exit.
fix: TOCTOU race in mktemp_portable fallback
… retry, use set for hard_skips

- Filename sanitization: also replace backslashes and NUL bytes in
  Content-Disposition header to prevent header injection
- Connection close: initialize conn=None before try block so the
  finally block doesn't hit NameError if _make_connection raises
- Retry-After: clamp to 0-300s with max(0, min(...)) instead of
  separate negative check
- hard_skips: use a set for O(1) membership checks instead of O(n)
  list scan in the mount-point dedup loop
fix(py): filename sanitization, conn guard, retry clamp, set for hard_skips
…Perl 5.8

- /proc/mounts: decode \040 and other octal escapes in mount points
  so paths with spaces are correctly recognized and excluded
- Encode::FB_DEFAULT(): use numeric constant 0x0001 instead of the
  named function, which was only added in Encode 2.53 (Perl 5.14).
  The script targets Perl 5.8+, where the function doesn't exist.
fix(pl): /proc/mounts octal escapes + Encode compat for Perl 5.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants