Skip to content

v1.5.0 — Per-File Intelligence (CSV, Duplicates, Top-N, Regex)

Latest

Choose a tag to compare

@PC5518 PC5518 released this 09 Apr 13:48
· 14 commits to main since this release
be38b35

The largest single feature release since v1.0.0. Five new opt-in capabilities, all running on the same single-pass traversal — no re-scanning, no behavioral change to existing code.

What's New

return_files=True — Per-file list in the result dict

The returned dict gains a "files" key: a Python list of dicts, one per scanned file.

result = anscom.scan("/project", return_files=True, silent=True)
for f in result["files"]:
    print(f["path"], f["size"], f["category"], f["mtime"])

Each entry has path, size, ext, category, mtime. Size and mtime come from the same stat / FindFirstFile call already being made — no extra syscall on Windows. On Linux, fstatat is called only when needed.

export_csv="inventory.csv" — UTF-8 CSV export

Per-file inventory with columns path,size,ext,category,mtime. RFC 4180-compliant quoting. Zero dependencies. Reuses the per-file array collected for return_files.

anscom.scan("/data", export_csv="inventory.csv", silent=True)

Pipe directly into pandas, openpyxl, or any standard CSV consumer:

import pandas as pd
df = pd.read_csv("inventory.csv")
df.to_excel("report.xlsx", index=False)

largest_n=N — Top-N largest files

Per-thread min-heap of capacity N. O(log N) cost per file, no extra pass, no full sort.

result = anscom.scan("/mnt/storage", largest_n=20, silent=True)
for f in result["largest_files"]:
    print(f"{f['size'] / 1024**3:.2f} GB  {f['path']}")

The printed report also gains a "TOP N LARGEST FILES" section.

find_duplicates=True — CRC32-based duplicate detection

Two-phase: (1) sort by size — files with a unique size are skipped entirely with zero I/O. (2) For same-size groups ≥2, read first 4096 bytes and compute CRC32.

result = anscom.scan("/media-library", find_duplicates=True, silent=True)
print(f"Duplicate groups: {len(result['duplicates'])}")

Combine with return_files=True to compute reclaimable space.

regex_filter="pattern" — Path pattern filter

Only count files whose full path matches a regex.

anscom.scan("/codebase", regex_filter=r"/tests/.*\.py$", silent=True)
  • Linux / macOS: Native POSIX regcomp(REG_EXTENDED | REG_NOSUB) + regexeczero GIL acquisition, runs fully in C inside the worker threads.
  • Windows: Falls back to Python's re module. For large Windows scans, prefer the extensions whitelist for zero-GIL filtering.

Invalid patterns raise ValueError immediately — no scan is started.

Performance

All five features are strictly opt-in. A plain anscom.scan(".") with no new parameters runs the identical hot path as v1.3.0 — no extra syscalls, no allocations per file, no behavioral change.

  • Per-thread FileInfo array pre-allocated at 65,536 entries — zero reallocations for typical scans.
  • fstatat on Linux called only when needed (two separate guards: one for type resolution, one for size/mtime collection).
  • Per-thread min-heap for largest_n — lock-free, merged after join.
  • Per-thread file arrays — lock-free, merged after join.

Migration from v1.3.0

No breaking changes. All v1.3.0 code runs unchanged on v1.5.0.

# v1.3.0 code — works identically on v1.5.0
result = anscom.scan("/data", silent=True, ignore_junk=True)

# v1.5.0 — opt into new features as needed
result = anscom.scan(
    "/data",
    silent=True,
    ignore_junk=True,
    return_files=True,
    largest_n=20,
    find_duplicates=True,
    export_csv="inventory.csv",
)

Bug Fixes

  • sorted_top paths are now strdup'd independently from global_heap — no lifetime overlap, no double-free.
  • fstatat on Linux is called only when min_size, return_files, export_csv, find_duplicates, or largest_n > 0 is active. Two separate guards for type resolution vs. size/mtime collection.
  • Full docstring on anscom.scan is now accessible via help(anscom.scan).

Removed

  • export_excel — was crashing on Windows due to an openpyxl Workbook.read_only exception. Use export_csv + pandas.to_excel() instead, which is faster, dependency-free at scan time, and works identically across platforms.

Full Example — Everything at Once

import anscom

result = anscom.scan(
    "/mnt/enterprise",
    max_depth=20,
    workers=32,
    ignore_junk=True,
    silent=True,
    return_files=True,
    largest_n=50,
    find_duplicates=True,
    export_json="audit.json",
    export_csv="inventory.csv",
    show_tree=True,
    export_tree="tree.txt",
)

print(f"Files        : {result['total_files']:,}")
print(f"Duration     : {result['duration_seconds']:.3f}s")
print(f"Dup groups   : {len(result['duplicates'])}")
print(f"Largest file : {result['largest_files'][0]['path']}")

One scan pass. Four output files. Full in-memory results.


Install: pip install --upgrade anscom
PyPI: https://pypi.org/project/anscom/1.5.0/
License: MIT