Skip to content

fix: handle Nirjas extraction errors and optimize file fallback #114#120

Open
Abh-igyan wants to merge 2 commits intofossology:masterfrom
Abh-igyan:master
Open

fix: handle Nirjas extraction errors and optimize file fallback #114#120
Abh-igyan wants to merge 2 commits intofossology:masterfrom
Abh-igyan:master

Conversation

@Abh-igyan
Copy link
Copy Markdown

@Abh-igyan Abh-igyan commented Mar 23, 2026

Description
Fixes a crash in the TF-IDF agent caused by unhandled exceptions during Nirjas comment extraction, and optimizes the file fallback mechanism by replacing inefficient in-memory reads with shutil.copyfileobj().
Changes

Wrapped commentExtract() call in a try/except block inside CommentPreprocessor.extract() to gracefully handle Nirjas parsing failures instead of crashing the entire scan
Replaced inFile.read().split('\n') pattern with shutil.copyfileobj() in the exception fallback path — avoids loading the entire file into memory
Applied the same shutil.copyfileobj() optimization to the unsupported file extension fallback (else block) which had the same inefficiency
Added logging.warning() to surface extraction failures without terminating the scan
Added shutil and logging to imports

shutil uses chunk based writing thus reducing RAM overload. Impressive right?

How to test

Clone the repository and install dependencies:
(remember to install poetry with curl and not apt (gives old version).
Use this: curl -sSL https://install.python-poetry.org | python3 -

bash

git clone https://github.com/fossology/atarashi
poetry install
poetry run preprocess
cd atarashi

Download BusyBox (the real-world project that triggers the crash):

bash

wget https://sources.buildroot.net/busybox/busybox-1.36.1.tar.bz2 #use this link (updated)
tar -xf busybox-1.36.1.tar.bz2

Run the TF-IDF agent scan on the extracted source tree:

bash
poetry run atarashi -a tfidf -s CosineSim ./busybox-1.36.1
#use poetry run and not just atarashi; atarashi is not installed in your global system path; it lives inside the Poetry virtual environment we just built.

Inferences:
The scan completes without raising IndexError: list index out of range
A WARNING log appears for any file Nirjas fails to parse
The scan continues and produces results for remaining files

Fixes #114

- Wrap commentExtract() call in try/except to prevent scan crashes
  when Nirjas fails to parse supported file types (fixes fossology#114)
- Replace inefficient read().split() pattern with shutil.copyfileobj()
  in both the exception fallback and unsupported file type paths
- Add logging.warning() to surface extraction failures without
  terminating the scan"
Copy link
Copy Markdown
Member

@hastagAB hastagAB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Abh-igyan, The fix LGTM!

As a follow-up, it'd be nice to: Narrow the except Exception to IndexError

But nothing blocking here. Thanks for the contribution! 🎉

@Abh-igyan
Copy link
Copy Markdown
Author

Abh-igyan commented Mar 25, 2026

"Thanks for the review @hastagAB ! Updated the except clause to catch IndexError specifically as suggested." Will try to think about those intricacies from now.
By the way, I have updated the PR description more elaborately step-by step to successfully run and test. As I had faced numerous issues while testing.

@Abh-igyan
Copy link
Copy Markdown
Author

Abh-igyan commented Mar 26, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TF-IDF agent crashes on real-world projects due to unhandled comment extraction errors

2 participants