Skip to content

Commit ccb168d

Browse files
author
Angelogeb
committed
Fixed small errors and added info
1 parent 532e0b4 commit ccb168d

3 files changed

Lines changed: 13 additions & 3 deletions

File tree

hw2/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
# Data Mining Homework 2
22

3+
# Report
4+
5+
The report can be found in `report.html`. It uses MathJax from a CDN therefore
6+
in case of missing network connection the formulas will be unreadable.
7+
38
# Problem 1
49

510
The program requires flask as a dependency otherwise just run the
@@ -20,3 +25,8 @@ See usage section and examples in report.html
2025
$ cd p3
2126
$ ./spark_lsh.py
2227
```
28+
29+
# Notes
30+
31+
Problem 2 and 3 require the mmh3 package although they provide a plain python
32+
implementation

hw2/report.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ <h3 id="conclusions-and-instructions-to-run">Conclusions and instructions to run
7373
<p>The system is written in <code>python3</code> and presented through a web interface served by a Flask server. To launch it, go in the <code>p1/</code> folder and execute <code>server.py</code>. After some seconds a new tab will be opened in the web browser with the interface to the tool.</p>
7474
<p><img src="images/q1.png" alt="First" /> <img src="images/q2.png" /></p>
7575
<p>The decisions made work well overall. Some examples of queries and results are shown above.</p>
76-
<p>In the first query a well known issue of cosine-similarity is shown: shorter documents get higher ranking since normalization penalizes longer documents. Indeed even if the first document contains only one of the query terms, still achieves a higher score than a longer document containing both terms.</p>
76+
<p>In the first query a well known issue of cosine-similarity is shown: shorter documents get higher ranking since normalization penalizes longer documents. Indeed even if the first document contains fewer occurrences of the query terms, still achieves a higher score than a longer document containing more occurrences of the terms.</p>
7777
<p>In the second example the query issued is the first document retrieved. As shown in this case the score achieved by the first document is way higher than the average cases given that the query is the document itself. Moreover also the processing time increases drastically.</p>
7878
<h2 id="problem-2">Problem 2</h2>
7979
<p>For this problem the <code>preprocessed_announcements.tsv</code> file produced by the preprocessing phase of the previous problem is used.</p>

hw2/report.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,8 +91,8 @@ queries and results are shown above.
9191
In the first query a well known issue of cosine-similarity is shown:
9292
shorter documents get higher ranking since normalization penalizes longer
9393
documents. Indeed even if the first document
94-
contains only one of the query terms, still achieves a higher score than
95-
a longer document containing both terms.
94+
contains fewer occurrences of the query terms, still achieves a higher score
95+
than a longer document containing more occurrences of the terms.
9696

9797
In the second example the query issued is the first document retrieved.
9898
As shown in this case the score achieved by the first document is way

0 commit comments

Comments
 (0)