Improve TF-IDF agent by tuning matches threshold

Hello.

I've been playing around with some _parameters_ of the TF-IDF agent.

I've found that if we stop using a threshold (`cosine similarity >= 0.30`) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):

https://github.com/fossology/atarashi/blob/6cdd4104a278b6d993363d5989c859ab78e5e21c/atarashi/agents/tfidf.py#L124-L136

Using the `evaluation.py` script, I've carried out some experiments:

|    | **Algorithm**                                 | **Time elapsed** | **Accuracy** |
|----|-----------------------------------------------|------------------|--------------|
| 1  | **_tfidf (CosineSim) (thr=0.30)_**            |    **_30.19_**   |  **_59.0%_** |
| 2  | tfidf (CosineSim) (thr=0.17)                  |       35.29      |     61.0%    |
| 3  | **tfidf (CosineSim) (thr=0.16, max_df=0.10)** |     **27.34**    |   **62.0%**  |
| 4  | **tfidf (CosineSim) (thr=0.16)**              |     **36.42**    |   **62.0%**  |
| 5  | tfidf (CosineSim) (thr=0.15)                  |       38.45      |     62.0%    |
| 6  | tfidf (CosineSim) (thr=0.10)                  |       39.91      |     62.0%    |
| 7  | tfidf (CosineSim) (thr=0.00)                  |       61.49      |     62.0%    |
| 8  | Ngram (CosineSim)                             |         -        |     57.0%    |
| 9  | Ngram (BigramCosineSim)                       |         -        |     56.0%    |
| 10 | Ngram (DiceSim)                               |         -        |     55.0%    |
| 11 | wordFrequencySimilarity                       |         -        |     23.0%    |
| 12 | DLD                                           |         -        |     17.0%    |
| 13 | tfidf (ScoreSim)                              |         -        |     13.0%    |

- Row 1 shows the performance (speed and accuracy) of the current configuration of the TF-IDF agent using _CosineSim_ as similarity measure.
- Row 7 shows how we can reach an accuracy of 62.% just by removing the threshold (`cosine similarity >= 0.00`). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is `0.16`, showed in row 4.
- In order to continue decreasing the excecution time and increasing the accuracy, I tuned some parameters of the _TfidfVectorizer_. Setting `max_df` to `0.10` (default is `1.0`) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.
   - Why does decreasing the `max_df` value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than the `max_df` percent of the documents ([see docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.
   - Why does decreasing the `max_df` value keeps the accuracy _high_? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.

I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.

Important notes:
- I've left out the speed times for all the other algorithms, because I ran those experiments in another context, so the comparison of time wouldn't be fair.
- All the results differ from the [last report I could find](https://github.com/fossology/atarashi/issues/65#issuecomment-572188249) out there. I do not fully understand why some of them are so different; probably changes in the test files or changes in the algorithms. Anyway, 62.0% is the new best result in both reports.
- My findings may help improve other agents that use thresholds, such as _Ngram_.
- This new _state-of-atarashi performance_ 😅 may also push the goals of future agents implementations, since it would be the new baseline.

	for counter, value in enumerate(all_documents_matrix, start=0):
	sim_score = self.__cosine_similarity(value, search_martix)
	if sim_score >= 0.3:
	matches.append({
	'shortname': self.licenseList.iloc[counter]['shortname'],
	'sim_type': "TF-IDF Cosine Sim",
	'sim_score': sim_score,
	'desc': ''
	})
	matches.sort(key=lambda x: x['sim_score'], reverse=True)
	if self.verbose > 0:
	print("time taken is " + str(time.time() - startTime) + " sec")
	return matches

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve TF-IDF agent by tuning matches threshold #95

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Algorithm	Time elapsed	Accuracy
1	*tfidf (CosineSim) (thr=0.30)*	*30.19*	*59.0%*
2	tfidf (CosineSim) (thr=0.17)	35.29	61.0%
3	tfidf (CosineSim) (thr=0.16, max_df=0.10)	27.34	62.0%
4	tfidf (CosineSim) (thr=0.16)	36.42	62.0%
5	tfidf (CosineSim) (thr=0.15)	38.45	62.0%
6	tfidf (CosineSim) (thr=0.10)	39.91	62.0%
7	tfidf (CosineSim) (thr=0.00)	61.49	62.0%
8	Ngram (CosineSim)	-	57.0%
9	Ngram (BigramCosineSim)	-	56.0%
10	Ngram (DiceSim)	-	55.0%
11	wordFrequencySimilarity	-	23.0%
12	DLD	-	17.0%
13	tfidf (ScoreSim)	-	13.0%

Improve TF-IDF agent by tuning matches threshold #95

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions