Skip to content

Skip non-UTF-8 files on subsequent indexing runs#49

Open
raphaelsty wants to merge 2 commits intomainfrom
fix/skip-ignored-utf8-files
Open

Skip non-UTF-8 files on subsequent indexing runs#49
raphaelsty wants to merge 2 commits intomainfrom
fix/skip-ignored-utf8-files

Conversation

@raphaelsty
Copy link
Collaborator

Summary

Files that fail to parse (e.g. invalid UTF-8 like CUDA samples, binary files) are now persisted in ignored_files in the index state. On subsequent runs, these files are silently skipped — no more repeated warning spam.

Before

Every colgrep invocation logs warnings for the same binary files:

⚠️  Skipping .../boost/numeric/interval/hw_rounding.hpp (stream did not contain valid UTF-8)
⚠️  Skipping .../boost/numeric/interval/constants.hpp (stream did not contain valid UTF-8)
... (20+ lines every run)

After

Warning logged once on first encounter, then silently skipped on all future runs.

Test plan

  • cargo test -p colgrep — 49 tests pass
  • cargo check -p colgrep compiles
  • CI passes

Files that fail to parse (e.g. invalid UTF-8) are now added to
ignored_files in IndexState. On subsequent indexing runs, these
files are silently skipped without logging warnings, avoiding
repeated noise from binary/non-UTF-8 files in large codebases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant