Release v1.5.0 — Per-File Intelligence (CSV, Duplicates, Top-N, Regex) · PC5518/anscom-nfie-python-extension

The largest single feature release since v1.0.0. Five new opt-in capabilities, all running on the same single-pass traversal — no re-scanning, no behavioral change to existing code.

What's New

`return_files=True` — Per-file list in the result dict

The returned dict gains a "files" key: a Python list of dicts, one per scanned file.

result = anscom.scan("/project", return_files=True, silent=True)
for f in result["files"]:
    print(f["path"], f["size"], f["category"], f["mtime"])

Each entry has path, size, ext, category, mtime. Size and mtime come from the same stat / FindFirstFile call already being made — no extra syscall on Windows. On Linux, fstatat is called only when needed.

`export_csv="inventory.csv"` — UTF-8 CSV export

Per-file inventory with columns path,size,ext,category,mtime. RFC 4180-compliant quoting. Zero dependencies. Reuses the per-file array collected for return_files.

anscom.scan("/data", export_csv="inventory.csv", silent=True)

Pipe directly into pandas, openpyxl, or any standard CSV consumer:

import pandas as pd
df = pd.read_csv("inventory.csv")
df.to_excel("report.xlsx", index=False)

`largest_n=N` — Top-N largest files

Per-thread min-heap of capacity N. O(log N) cost per file, no extra pass, no full sort.

result = anscom.scan("/mnt/storage", largest_n=20, silent=True)
for f in result["largest_files"]:
    print(f"{f['size'] / 1024**3:.2f} GB  {f['path']}")

The printed report also gains a "TOP N LARGEST FILES" section.

`find_duplicates=True` — CRC32-based duplicate detection

Two-phase: (1) sort by size — files with a unique size are skipped entirely with zero I/O. (2) For same-size groups ≥2, read first 4096 bytes and compute CRC32.

result = anscom.scan("/media-library", find_duplicates=True, silent=True)
print(f"Duplicate groups: {len(result['duplicates'])}")

Combine with return_files=True to compute reclaimable space.

`regex_filter="pattern"` — Path pattern filter

Only count files whose full path matches a regex.

anscom.scan("/codebase", regex_filter=r"/tests/.*\.py$", silent=True)

Linux / macOS: Native POSIX regcomp(REG_EXTENDED | REG_NOSUB) + regexec — zero GIL acquisition, runs fully in C inside the worker threads.
Windows: Falls back to Python's re module. For large Windows scans, prefer the extensions whitelist for zero-GIL filtering.

Invalid patterns raise ValueError immediately — no scan is started.

Performance

All five features are strictly opt-in. A plain anscom.scan(".") with no new parameters runs the identical hot path as v1.3.0 — no extra syscalls, no allocations per file, no behavioral change.

Per-thread FileInfo array pre-allocated at 65,536 entries — zero reallocations for typical scans.
fstatat on Linux called only when needed (two separate guards: one for type resolution, one for size/mtime collection).
Per-thread min-heap for largest_n — lock-free, merged after join.
Per-thread file arrays — lock-free, merged after join.

Migration from v1.3.0

No breaking changes. All v1.3.0 code runs unchanged on v1.5.0.

# v1.3.0 code — works identically on v1.5.0
result = anscom.scan("/data", silent=True, ignore_junk=True)

# v1.5.0 — opt into new features as needed
result = anscom.scan(
    "/data",
    silent=True,
    ignore_junk=True,
    return_files=True,
    largest_n=20,
    find_duplicates=True,
    export_csv="inventory.csv",
)

Bug Fixes

sorted_top paths are now strdup'd independently from global_heap — no lifetime overlap, no double-free.
fstatat on Linux is called only when min_size, return_files, export_csv, find_duplicates, or largest_n > 0 is active. Two separate guards for type resolution vs. size/mtime collection.
Full docstring on anscom.scan is now accessible via help(anscom.scan).

Removed

export_excel — was crashing on Windows due to an openpyxl Workbook.read_only exception. Use export_csv + pandas.to_excel() instead, which is faster, dependency-free at scan time, and works identically across platforms.

Full Example — Everything at Once

import anscom

result = anscom.scan(
    "/mnt/enterprise",
    max_depth=20,
    workers=32,
    ignore_junk=True,
    silent=True,
    return_files=True,
    largest_n=50,
    find_duplicates=True,
    export_json="audit.json",
    export_csv="inventory.csv",
    show_tree=True,
    export_tree="tree.txt",
)

print(f"Files        : {result['total_files']:,}")
print(f"Duration     : {result['duration_seconds']:.3f}s")
print(f"Dup groups   : {len(result['duplicates'])}")
print(f"Largest file : {result['largest_files'][0]['path']}")

One scan pass. Four output files. Full in-memory results.

Install: pip install --upgrade anscom
PyPI: https://pypi.org/project/anscom/1.5.0/
License: MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.5.0 — Per-File Intelligence (CSV, Duplicates, Top-N, Regex)

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's New

`return_files=True` — Per-file list in the result dict

`export_csv="inventory.csv"` — UTF-8 CSV export

`largest_n=N` — Top-N largest files

`find_duplicates=True` — CRC32-based duplicate detection

`regex_filter="pattern"` — Path pattern filter

Performance

Migration from v1.3.0

Bug Fixes

Removed

Full Example — Everything at Once

Uh oh!

v1.5.0 — Per-File Intelligence (CSV, Duplicates, Top-N, Regex)

What's New

return_files=True — Per-file list in the result dict

export_csv="inventory.csv" — UTF-8 CSV export

largest_n=N — Top-N largest files

find_duplicates=True — CRC32-based duplicate detection

regex_filter="pattern" — Path pattern filter

Performance

Migration from v1.3.0

Bug Fixes

Removed

Full Example — Everything at Once

Uh oh!

`return_files=True` — Per-file list in the result dict

`export_csv="inventory.csv"` — UTF-8 CSV export

`largest_n=N` — Top-N largest files

`find_duplicates=True` — CRC32-based duplicate detection

`regex_filter="pattern"` — Path pattern filter