Skip to content

feat(SemanticSearch): Added a new semantic search agent that uses fuzzy string mathcing and levenshtein distance.#103

Open
Hero2323 wants to merge 2 commits intofossology:masterfrom
Hero2323:master
Open

feat(SemanticSearch): Added a new semantic search agent that uses fuzzy string mathcing and levenshtein distance.#103
Hero2323 wants to merge 2 commits intofossology:masterfrom
Hero2323:master

Conversation

@Hero2323
Copy link
Copy Markdown

Added a new Semantic Search Agent that can be used as follows:

atarashi -a SemanticSearch /path/to/file.c

…. The project was not building without this update, using the same package values specified.
…zy string mathcing and levenshtein distance.
Copy link
Copy Markdown
Member

@Kaushl2208 Kaushl2208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks good, Needs test!!

Maybe we can add the agent to Build and Test stage?

@Kaushl2208 Kaushl2208 added needs test GSoC-2024 Pull request submitted under Google Summer Of Code 2024 labels Oct 24, 2024
Copy link
Copy Markdown
Member

@Kaushl2208 Kaushl2208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hero2323 , I tested the working! Found some issues :)
Also, Update the README on how to use SemanticSearch Agent. (processLicenseList flag) etc.
Please take a look.

Comment on lines +39 to +40
def __init__(self, licenseList):
super().__init__(licenseList)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, licenseList):
super().__init__(licenseList)
def __init__(self, licenseList, verbose=0):
super().__init__(licenseList)

If verbose type output is planned, The input flag for verbose is defined but not passed. Prone to throw error

fuzzy_similarity_matrix_2 = np.zeros(len(self.licenseList))
for i in range(len(self.licenseList)):
fuzzy_similarity_matrix_2[i] = fuzz.ratio(appended_comment, self.licenseList.loc[i, 'text'])
if pd.notna(licenseList.loc[i, 'license_header']):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if pd.notna(licenseList.loc[i, 'license_header']):
if pd.notna(self.licenseList.loc[i, 'license_header']):

licenseList variable is not accessible

args = parser.parse_args()

inputFile = args.inputFile
licenseList = args.processedLicenseList
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also make it more reliable, If user doesnt provide processedLicenseList, Agent should pick what we already have :)

Something like:

defaultProcessed = resource_filename("atarashi",
                                       "data/licenses/processedLicenses.csv")

if processedLicense is None:
    processedLicense = defaultProcessed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

GSoC-2024 Pull request submitted under Google Summer Of Code 2024 needs test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants