Skip to content

Feat(models): Implemented three models for license similarity#69

Open
Kaushl2208 wants to merge 1 commit intofossology:masterfrom
Kaushl2208:feat/model
Open

Feat(models): Implemented three models for license similarity#69
Kaushl2208 wants to merge 1 commit intofossology:masterfrom
Kaushl2208:feat/model

Conversation

@Kaushl2208
Copy link
Copy Markdown
Member

@Kaushl2208 Kaushl2208 commented Aug 11, 2020

Description

Implementation of Logistic Regression, Multinomial Naive Bayes and Linear SVC on license dataset licenseList.csv. The main purpose of implementing this idea was to plan for a model which can make atarashi faster and more accurate.

Files

  • train.py (Training the models and saving in binary)
  • test.py ( For the testing purpose)
  • lr_model.pkl (Binary file for logistic regression)
  • nb_model.pkl(Binary file for Multinomial Naive Bayes)
  • svc_model.pkl(Binary file for Linear SVC)
  • vectorizer.pkl (Binary file for storing vocabulary)

How to use?

  • Test the models

    • atarashi -a lr_classifier path/to/file (Logistic Regression)
    • atarashi -a nb_classifier path/to/file (Multinomial Naive Bayes)
    • atarashi -a svc_classifier path/to/file (Linear SVC)
  • Train the models (Optional)

    • From the base folder run : python3 atarashi/agents/models/train.py

ToDo

  • Test working and accuracy of the algorithms using evaluator.py

  • proper integration with atarashii.py

Accuracy Score

Model Name Accuracy Score in % Time taken on 100 files in (sec)
Logistic Regression 31 88.6
Linear SVC 36 79.4
Multinomial Naive Bayes 30 83.72

Future Scope

  • The well-defined dataset will increase the similarity accuracy even more. By well-defined dataset I mean with newly updated licenses also ( 1 class to n License) style license file will do the work.

CC: @hastagAB @GMishx @ag4ums

Signed off by: Kaushlendra Pratap Singh kaushlendrapratap.9837@gmail.com

@Kaushl2208
Copy link
Copy Markdown
Member Author

@hastagAB @GMishx , I implemented the models command into atarashii.py but it seems like I am missing something to update somewhere in code.

@Kaushl2208
Copy link
Copy Markdown
Member Author

@GMishx @ag4ums I have run all three models on the Test files and I am attaching the screenshot of the results.

SVC

SVC

NB

NB

Logistic Regression

LR

@GMishx GMishx added the GSOC-20 label Aug 20, 2020

def model_train():

data = pd.read_csv("atarashi/data/licenses/licenseList.csv")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a future improvement, SPDX license data can be pulled in using atarashi.license.licenseDownloader.LicenseDownloader.download_license and merged with main list using atarashi.license.license_merger.license_merger.

Copy link
Copy Markdown
Member

@GMishx GMishx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more changes are required. And please squash your commits.

Copy link
Copy Markdown
Member

@GMishx GMishx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agent looks good.
Tested with pip install .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ready PR ready for merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants