Skip to content

feat(newagent): Add a new Keyword Agent for pre-checking#109

Open
rajuljha wants to merge 1 commit intofossology:masterfrom
rajuljha:feat/newagent/Keyword
Open

feat(newagent): Add a new Keyword Agent for pre-checking#109
rajuljha wants to merge 1 commit intofossology:masterfrom
rajuljha:feat/newagent/Keyword

Conversation

@rajuljha
Copy link
Copy Markdown

@rajuljha rajuljha commented Jul 2, 2025

Description

Introduced a new agent called KeywordAgent that performs pre-checks for license-possibility.

Changes

  • Add a new KeywordAgent class which handles KeywordAgent logic.
  • Modify licenseDownloader.py to include helper methods of downloading and building license_refs.csv file
  • Add building license_refs as a step in build_deps.py
  • Add license_keywords.txt and license_refs_combined.csv files
  • Add keyword_eval.py to evaluate KeywordAgent accuracy on NomosTestFiles

How to test

[!NOTE]
If building atarashi, run the build_depy.py script as well.

  1. For running the atarashi agent:
chmod +x atarashi/atarashii.py
poetry run atarashi/atarashii.py -a <agent_name> <path_to_file_or_folder>
  1. For running evaluator script:
chmod +x atarashi/evaluator/keyword_eval.py
poetry run atarashi/evaluator/keyword_eval.py 

Screenshots

  1. Evaluator results:
Screenshot 2025-07-02 at 12 09 10 PM
  1. With keywordAgent being invoked:
Screenshot 2025-07-02 at 11 42 58 AM
  1. Without KeywordAgent being invoked.
Screenshot 2025-07-02 at 11 42 35 AM

Copy link
Copy Markdown
Member

@Kaushl2208 Kaushl2208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rajuljha,

I've reviewed the changes and recommended a few minor adjustments. Overall, the changes look good at first glance.

I have a question regarding the NomosTestFiles: Is it necessary to commit these files directly? Could we consider downloading them at runtime instead? For instance, if a user wishes to run the evaluator for Keywords, they could download the files when needed, allowing the evaluation to be performed dynamically.

From our perspective, the evaluation script serves as a form of functional testing, which we can utilize in CI stages to ensure everything is functioning correctly. In these scenarios, I believe we could download the TestFiles and execute the evaluator as required, correct?

if os.path.isfile(input_path):
results = agent.scan(input_path)
if results:
print(f"Scan results for {input_path}:")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"Scan results for {input_path}:")
print(f"Keyword Scan results for {input_path}:")

file_path = os.path.join(root, file)
results = agent.scan(file_path)
if results:
print(f"Scan results for {file_path}:")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Since it is more of a support agent rather than being a License Detector as of now. We should also mention that specifically and I currently wondering if it can act as a standalone agent altogether or it can act as a bypasser for all the other agents? What do you think @rajuljha ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeywordAgent can act indepdantly as well by directly running the keyWordAgent.py file. Not through atarashi cli currently. Inside the atarashi cli, it only acts as a support agent for now. Let me know if it should be made accessible in the CLI as well.

warnings.simplefilter("ignore")
sklearn_tfidf = TfidfVectorizer(min_df=1, use_idf=True, smooth_idf=True,
sublinear_tf=True, tokenizer=tokenize,
token_pattern=None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hoping this is for a bug fix??

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup!

ngram_json = defaultJSON
# Validate compatibility between agent and similarity
if args.agent_name == "tfidf" and args.similarity not in ["CosineSim", "ScoreSim"]:
print("Error: TFIDF agent supports only CosineSim or ScoreSim.", file=sys.stderr)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this warning should be there in the help command for the agent right??

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this to the help command, but left this as a sanity check for the agent!

result = {"file": os.path.abspath(inputPath), "results": result}
result = json.dumps(result, sort_keys=True, ensure_ascii=False, indent=4)
print(result + "\n")
keyword_ok = args.skip_keyword or keyword_scanner.scan(inputPath)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oww, My bad. Keyword agent actually working as a support agent here for any type of scan. :D

:return: HTTP Pool Manager
"""
proxy_val = os.environ.get('http_proxy', False)
proxy_val = os.environ.get('http_proxy')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch :P

Comment on lines +169 to +175
cpuCount = os.cpu_count()

num_threads = threads if threads is not None else cpuCount
if num_threads is None:
num_threads = 1 # Fallback
if cpuCount is not None:
num_threads = min(num_threads, cpuCount * 2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a modular function??

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

@rajuljha
Copy link
Copy Markdown
Author

rajuljha commented Jul 6, 2025

Hi @rajuljha,

I've reviewed the changes and recommended a few minor adjustments. Overall, the changes look good at first glance.

I have a question regarding the NomosTestFiles: Is it necessary to commit these files directly? Could we consider downloading them at runtime instead? For instance, if a user wishes to run the evaluator for Keywords, they could download the files when needed, allowing the evaluation to be performed dynamically.

From our perspective, the evaluation script serves as a form of functional testing, which we can utilize in CI stages to ensure everything is functioning correctly. In these scenarios, I believe we could download the TestFiles and execute the evaluator as required, correct?

Yeah, we can do that. I'll take a look and fix this!

Keyword agent pre-checks for certain keywords for license-possibility
detection. Keywords are from Nomos's STRINGS.in, FOSSology's
license_ref.json and SPDX licenses and exceptions in a two stage pipeline.

Signed-off-by: Rajul Jha <rajuljha49@gmail.com>
@rajuljha rajuljha force-pushed the feat/newagent/Keyword branch from 6031e7a to adf3d98 Compare July 19, 2025 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants