Hello.
I've been playing around with some parameters of the TF-IDF agent.
I've found that if we stop using a threshold (cosine similarity >= 0.30) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):
|
for counter, value in enumerate(all_documents_matrix, start=0): |
|
sim_score = self.__cosine_similarity(value, search_martix) |
|
if sim_score >= 0.3: |
|
matches.append({ |
|
'shortname': self.licenseList.iloc[counter]['shortname'], |
|
'sim_type': "TF-IDF Cosine Sim", |
|
'sim_score': sim_score, |
|
'desc': '' |
|
}) |
|
matches.sort(key=lambda x: x['sim_score'], reverse=True) |
|
if self.verbose > 0: |
|
print("time taken is " + str(time.time() - startTime) + " sec") |
|
return matches |
Using the evaluation.py script, I've carried out some experiments:
|
Algorithm |
Time elapsed |
Accuracy |
| 1 |
tfidf (CosineSim) (thr=0.30) |
30.19 |
59.0% |
| 2 |
tfidf (CosineSim) (thr=0.17) |
35.29 |
61.0% |
| 3 |
tfidf (CosineSim) (thr=0.16, max_df=0.10) |
27.34 |
62.0% |
| 4 |
tfidf (CosineSim) (thr=0.16) |
36.42 |
62.0% |
| 5 |
tfidf (CosineSim) (thr=0.15) |
38.45 |
62.0% |
| 6 |
tfidf (CosineSim) (thr=0.10) |
39.91 |
62.0% |
| 7 |
tfidf (CosineSim) (thr=0.00) |
61.49 |
62.0% |
| 8 |
Ngram (CosineSim) |
- |
57.0% |
| 9 |
Ngram (BigramCosineSim) |
- |
56.0% |
| 10 |
Ngram (DiceSim) |
- |
55.0% |
| 11 |
wordFrequencySimilarity |
- |
23.0% |
| 12 |
DLD |
- |
17.0% |
| 13 |
tfidf (ScoreSim) |
- |
13.0% |
- Row 1 shows the performance (speed and accuracy) of the current configuration of the TF-IDF agent using CosineSim as similarity measure.
- Row 7 shows how we can reach an accuracy of 62.% just by removing the threshold (
cosine similarity >= 0.00). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is 0.16, showed in row 4.
- In order to continue decreasing the excecution time and increasing the accuracy, I tuned some parameters of the TfidfVectorizer. Setting
max_df to 0.10 (default is 1.0) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.
- Why does decreasing the
max_df value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than the max_df percent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.
- Why does decreasing the
max_df value keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.
I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.
Important notes:
- I've left out the speed times for all the other algorithms, because I ran those experiments in another context, so the comparison of time wouldn't be fair.
- All the results differ from the last report I could find out there. I do not fully understand why some of them are so different; probably changes in the test files or changes in the algorithms. Anyway, 62.0% is the new best result in both reports.
- My findings may help improve other agents that use thresholds, such as Ngram.
- This new state-of-atarashi performance 😅 may also push the goals of future agents implementations, since it would be the new baseline.
Hello.
I've been playing around with some parameters of the TF-IDF agent.
I've found that if we stop using a threshold (
cosine similarity >= 0.30) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):atarashi/atarashi/agents/tfidf.py
Lines 124 to 136 in 6cdd410
Using the
evaluation.pyscript, I've carried out some experiments:cosine similarity >= 0.00). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is0.16, showed in row 4.max_dfto0.10(default is1.0) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.max_dfvalue increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than themax_dfpercent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.max_dfvalue keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.
Important notes: