Skip to content

fix: add try-except block to comment extraction to prevent TF-IDF crash (#114)#121

Open
DhruvrajSinhZala24 wants to merge 1 commit intofossology:masterfrom
DhruvrajSinhZala24:fix/issue-114
Open

fix: add try-except block to comment extraction to prevent TF-IDF crash (#114)#121
DhruvrajSinhZala24 wants to merge 1 commit intofossology:masterfrom
DhruvrajSinhZala24:fix/issue-114

Conversation

@DhruvrajSinhZala24
Copy link
Copy Markdown

Fixes #114

Problem

When running the Atarashi TF-IDF agent on real-world source trees (e.g., BusyBox), the scan occasionally crashes due to unhandled exceptions originating from the nirjas comment extractor. This results in the entire scanner terminating prematurely, which impacts the reliability of FOSSology scans.

Reproduction Steps

  1. Download a real-world source tree like BusyBox.
  2. Run the TF-IDF agent on a file that contains malformed or complex comment structures (e.g., atarashi -a tfidf -s CosineSim path/to/file.c).
  3. Observe the crash: The scanner exits with an exception from nirjas instead of completing the scan.

Root Cause

The CommentPreprocessor.extract() method previously assumed that comment extraction would always succeed for supported file extensions and only used simple if-else logic. However, nirjas.extract() can raise exceptions when encountering files with unexpected encodings, drastically malformed comment blocks (e.g., unclosed comments), or rare syntax patterns. These unhandled exceptions bubble up and halt the entire atarashi agent.

Fix Implemented

I have added a try...except block around the comment extraction phase in atarashi/libs/commentPreprocessor.py.

  • If nirjas.extract() fails, the exception is caught.
  • The system then gracefully falls back to reading the raw file contents unconditionally (treating the whole file as a potential license holder).
  • This is the same fail-safe mechanism already used by Atarashi for unsupported file extensions, ensuring the scanner is resilient and always completes its task.

Testing Performed

  • Unit Test: Created a malformed C file with unclosed comments and verified that the scanner no longer crashes.
  • Mock Stress Test: Verified via a Python script that mocking a nirjas exception correctly triggers the fallback path, extracting all text from the raw file as expected.
  • Environment: Tested on Python 3.12/3.13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TF-IDF agent crashes on real-world projects due to unhandled comment extraction errors

1 participant